From b299417b88c6a6a2118382e5609fa94bbe2764d7 Mon Sep 17 00:00:00 2001
From: Clemens Neudecker <952378+cneud@users.noreply.github.com>
Date: Sat, 14 Mar 2020 04:58:27 +0100
Subject: [PATCH] Update README.md

---
 README.md | 58 ++++++++++++++++++-------------------------------------
 1 file changed, 19 insertions(+), 39 deletions(-)

diff --git a/README.md b/README.md
index 5694fb4..90fbe74 100644
--- a/README.md
+++ b/README.md
@@ -1,5 +1,4 @@
 # neat: named entity annotation tool
-#### version 0.1
 ---
 ![Screenshot](https://user-images.githubusercontent.com/952378/72167036-ad2c6680-33ca-11ea-980f-5859e5155877.png)
 ---
@@ -17,22 +16,20 @@
     
 &nbsp;&nbsp;&nbsp;[2.4 Data preparation](https://github.com/qurator-spk/neat/blob/master/README.md#24-data-preparation)
     
-&nbsp;&nbsp;&nbsp;[2.5 Provenance](https://github.com/qurator-spk/neat/blob/master/README.md#25-provenance)
+&nbsp;&nbsp;&nbsp;[2.5 Keyboard navigation](https://github.com/qurator-spk/neat/blob/master/README.md#26-keyboard-navigation)
     
-&nbsp;&nbsp;&nbsp;[2.6 Keyboard navigation](https://github.com/qurator-spk/neat/blob/master/README.md#26-keyboard-navigation)
+&nbsp;&nbsp;&nbsp;[2.6 Mouse navigation](https://github.com/qurator-spk/neat/blob/master/README.md#27-mouse-navigation)
     
-&nbsp;&nbsp;&nbsp;[2.7 Mouse navigation](https://github.com/qurator-spk/neat/blob/master/README.md#27-mouse-navigation)
+&nbsp;&nbsp;&nbsp;[2.7 Image support](https://github.com/qurator-spk/neat/blob/master/README.md#28-image-support)
     
-&nbsp;&nbsp;&nbsp;[2.8 Image support](https://github.com/qurator-spk/neat/blob/master/README.md#28-image-support)
-    
-&nbsp;&nbsp;&nbsp;[2.9 Saving progress](https://github.com/qurator-spk/neat/blob/master/README.md#29-saving-progress)
+&nbsp;&nbsp;&nbsp;[2.8 Saving progress](https://github.com/qurator-spk/neat/blob/master/README.md#29-saving-progress)
 
 [3. Annotation Guidelines](https://github.com/qurator-spk/neat/blob/master/README.md#3-annotation-guidelines)
 
 ### 1. Introduction
-[neat](https://github.com/qurator-spk/neat) is a simple, browser-based tool for editing and annotating text with named entities to produce a corpus for training/testing/evaluation. It can be used to add or correct named entity BIO-tags in a TSV file and to correct the token text or tokenization (e.g. due to OCR/segmentation errors). 
+[neat](https://github.com/qurator-spk/neat) is a simple, browser-based tool for editing and annotating text with named entities to produce a dataset for training/testing/evaluation. It can be used to add or correct named entity BIO-tags in a TSV file and to correct the token text or tokenization (e.g. due to OCR/segmentation errors). 
 
-[neat](https://github.com/qurator-spk/neat) is developed at the [Berlin State Library](https://staatsbibliothek-berlin.de/) for data annotation in the context of the [SoNAR-IDH](https://sonar.fh-potsdam.de/) project and the [QURATOR](https://qurator.ai/) project.
+[neat](https://github.com/qurator-spk/neat) is developed at the [Berlin State Library](https://staatsbibliothek-berlin.de/) for data annotation in the [SoNAR-IDH](https://sonar.fh-potsdam.de/) project and the [QURATOR](https://qurator.ai/) project.
 
 ### 2. User Guide
 
@@ -40,7 +37,7 @@
 [neat](https://github.com/qurator-spk/neat) runs locally as a pure HTML+JavaScript webpage in your web browser. No software needs to be installed, but JavaScript has to be enabled in the browser. 
 
 #### 2.2. Installation
-Simply clone the repo using ``git clone https://github.com/qurator-spk/neat.git`` or download the [ZIP](https://github.com/qurator-spk/neat/archive/master.zip). Make sure you have at minimum ``neat.html`` and ``neat.js`` residing in a local directory, then it is sufficient to just open ``neat.html`` in a browser. Any fairly recent browser should work, but only Chrome and Firefox are tested.
+Clone the repo using ``git clone https://github.com/qurator-spk/neat.git`` or download the [ZIP](https://github.com/qurator-spk/neat/archive/master.zip). Make sure you have ``neat.html`` and ``neat.js`` in the same directory and open ``neat.html`` in a browser. Any fairly recent browser should work, but only Chrome and Firefox are tested.
 
 #### 2.3 Data format   
 The data format is based on the format used in the [GermEval2014 Named Entity Recognition Shared Task](https://sites.google.com/site/germeval2014ner/data). Text is encoded as one token per line, with name spans encoded in the BIO-scheme, provided as tab-separated values:
@@ -76,9 +73,9 @@ No.	TOKEN	NE-TAG	NE-EMB
 ```
 
 For our purposes we extend this format by adding
-* a fifth column for an ``ID`` for the outer ``NE-TAG`` from an authority file (in this case [Wikidata](https://www.wikidata.org/wiki/Wikidata:Main_Page) is used) 
-* column six for use as a variable ``url_id`` (see [Image Support](https://github.com/qurator-spk/neat/blob/master/README.md#28-image-support) for further details)
-* finally, columns 7+ are used for storing ``left,right,top,bottom`` pixel coordinates for facsimile snippets 
+* a fifth column for an ``ID`` for the outer ``NE-TAG`` from an authority file
+* column six for use as a variable ``url_id`` (see [Image Support](https://github.com/qurator-spk/neat/blob/master/README.md#27-image-support) for further details)
+* finally, columns 7+ are used for storing ``left,right,top,bottom`` pixel coordinates for image snippets 
 
 Example (full):
 ```tsv
@@ -105,23 +102,9 @@ No.	TOKEN	NE-TAG	NE-EMB	ID	url_id	left,right,top,bottom
 ```
 
 #### 2.4 Data preparation  
-The source data that is used for annotation are OCR results in [PAGE-XML](https://github.com/PRImA-Research-Lab/PAGE-XML) format. We provide a [Python tool](https://github.com/qurator-spk/page2tsv) that supports the transformation of [PAGE-XML](https://github.com/PRImA-Research-Lab/PAGE-XML) OCR files into the [TSV format](https://github.com/qurator-spk/neat/blob/master/README.md#23-data-format) required for use with [neat](https://github.com/qurator-spk/neat).
-
-#### 2.5 Provenance
-The processing pipeline applied at the Berlin State Library comprises the follows steps: 
-
-1. Layout Analysis & Textline Extraction       
-Layout Analysis & Textline Extraction @[sbb_textline_detector](https://github.com/qurator-spk/sbb_textline_detector)
-2. OCR & Word Segmentation    
-OCR is based on [OCR-D](https://github.com/OCR-D)'s [ocrd_tesserocr](https://github.com/OCR-D/ocrd_tesserocr) which requires [Tesseract](https://github.com/tesseract-ocr/tesseract) **>= 4.1.0**. The [GT4HistOCR_2000000](https://ub-backup.bib.uni-mannheim.de/~stweil/ocrd-train/data/GT4HistOCR_2000000.traineddata) model, which is [trained](https://github.com/tesseract-ocr/tesstrain/wiki/GT4HistOCR) on the [GT4HistOCR](https://zenodo.org/record/1344132) corpus, is used. Further details are available in the [paper](https://arxiv.org/abs/1809.05501).
-3. TSV Transformation   
-A simple [Python tool](https://github.com/qurator-spk/page2tsv) is used for the transformation of the OCR results in [PAGE-XML](https://github.com/PRImA-Research-Lab/PAGE-XML) to [TSV](https://github.com/qurator-spk/neat/blob/master/docs/README.md#23-data-format).
-4. Tokenization    
-For tokenization, [SoMaJo](https://github.com/tsproisl/SoMaJo) is used.
-5. Named Entity Recognition    
-For Named Entity Recognition, a [BERT-Base](https://github.com/google-research/bert) model was trained for noisy OCR texts with historical spelling variation. [sbb_ner](https://github.com/qurator-spk/sbb_ner) is using a combination of unsupervised training on a large (~2.3m pages) [corpus of German OCR](https://zenodo.org/record/3257041) in combination with supervised training on a small (47k tokens) [annotated corpus](https://github.com/EuropeanaNewspapers/ner-corpora/tree/master/enp_DE.sbb.bio). Further details are available in the [paper](https://corpora.linguistik.uni-erlangen.de/data/konvens/proceedings/papers/KONVENS2019_paper_4.pdf).
-
-#### 2.6 Keyboard-Navigation
+The source data that is used for annotation are OCR results in [PAGE-XML](https://github.com/PRImA-Research-Lab/PAGE-XML) format. We provide a [Python tool](https://github.com/qurator-spk/page2tsv) for the transformation of [PAGE-XML](https://github.com/PRImA-Research-Lab/PAGE-XML) OCR files into the [TSV format](https://github.com/qurator-spk/neat/blob/master/README.md#23-data-format) used by [neat](https://github.com/qurator-spk/neat).
+
+#### 2.5 Keyboard-Navigation
 | Key Combination|      Action      |
 |:---------|:-------------------------------------------|
 | Left     |  Move one cell left                        |
@@ -162,7 +145,7 @@ For Named Entity Recognition, a [BERT-Base](https://github.com/google-research/b
 | l r      | remove on display row (minimum is 5)       |
 |----------|--------------------------------------------|
 
-#### 2.7 Mouse-Navigation
+#### 2.6 Mouse-Navigation
 * use mouse wheel to scroll up and down
 
 * left-click `<<` and `>>` to move 15 rows up or down
@@ -179,14 +162,11 @@ For Named Entity Recognition, a [BERT-Base](https://github.com/google-research/b
 
 * left-click the `POSITION` of a row and select `start-sentence` from the drop-down menu to start a new sentence
 
-#### 2.8 Image Support
-Provided facsimile images are available online via the [iiif.io](https://iiif.io/) Image API, [neat](https://github.com/qurator-spk/neat) supports the embedding of facsimile snippets into its interface to help with data annotation and correction. 
-This further requires that OCR with word segmentation is applied to the image to determine bounding boxes for tokens. 
-
-The iiif-image-url contained in the source ``#`` can then be used as a replacement for ``url_id`` in combination with the token bounding boxes as ``left,right,top,bottom`` to obtain the facsimile snippet url and display the image in the leftmost column. Clicking on the facsimile snippet opens up a new tab with a larger context.
+#### 2.7 Image Support
+Provided facsimile images are available via the [iiif.io](https://iiif.io/) Image API, [neat](https://github.com/qurator-spk/neat) supports the embedding of image snippets into its interface to assist data annotation and correction. This requires that the PAGE-XML OCR contains word bounding boxes. 
 
-#### 2.9 Saving progress
-[neat](https://github.com/qurator-spk/neat) runs fully locally in the browser. Therefore it can not automatically save any changes you made to disk. You have to use the `Save Changes` button in order to so manually from time to time. If your browser automatically saves all downloads to your `Downloads` folder, you might want to configure it so that it instead prompts you where to save.
+#### 2.8 Saving progress
+[neat](https://github.com/qurator-spk/neat) runs fully locally in the browser. Therefore it can not automatically save any changes you made to disk. You have to use the `Save Changes` button in order to so manually from time to time.
 
 ### 3. Annotation Guidelines
-The most recent version of the [Annotation Guidelines](https://github.com/qurator-spk/neat/blob/master/Annotation_Guidelines.pdf) is included in this repository. 
+[Annotation Guidelines](https://github.com/qurator-spk/neat/blob/master/Annotation_Guidelines.pdf)