neat/User_Guide.md

# User Guide
#### version 0.1

### Table of contents
[1. Introduction](https://github.com/qurator-spk/neath/blob/master/docs/User_Guide.md#1-introduction) 

[2. User Guide](https://github.com/qurator-spk/neath/blob/master/docs/User_Guide.md#2-user-guide)   

[3. Annotation Guidelines](https://github.com/qurator-spk/neath/blob/master/docs/User_Guide.md#3-annotation-guidelines)   

### 1. Introduction
[neath](https://github.com/qurator-spk/neath) is a simple, browser-based tool for editing and annotating text with named entities to produce a corpus for training/testing/evaluation. It can be used to add or correct named entity BIO-tags in a TSV file and to correct the token text or tokenization (e.g. due to OCR/segmentation errors). 

[neath](https://github.com/qurator-spk/neath) is developed at the [Berlin State Library](http://staatsbibliothek-berlin.de/) for data annotation in the context of the [SoNAR-IDH](https://sonar.fh-potsdam.de/) project and the [QURATOR](https://qurator.ai/) project.

### 2. User Guide

#### Technical Requirements 
[neath](https://github.com/qurator-spk/neath) runs locally as a pure HTML+JavaScript webpage in your web browser. No software needs to be installed, but JavaScript has to be enabled in the browser. Any fairly recent browser should work, but only Chrome and Firefox are tested.

#### Data format   
The data format is based on the format used in the [GermEval2014 Named Entity Recognition Shared Task](https://sites.google.com/site/germeval2014ner/data). Text is encoded as one token per line, with name spans encoded in the BIO-scheme, provided as tab-separated values:
* the first column contains either a `#`, which signals the source the sentence is cited from, or 
* the token position within the sentence ``>=1``
* sentence boundaries are indicated by ``0``
* the second column contains the token ``text`` 
* outer entity spans are encoded in the third column ``NE-TAG``
* embedded entity spans are encoded in the fourth column ``NE-EMB`` 

Example (simple):
```tsv
No.	TOKEN	NE-TAG	NE-EMB
# https://example.url
1	Donnerstag	O	O
2	,	O	O
3	1	O	O	
4	.	O	O	
5	Januar	O	O	
6	.	O	O		
0		O	O
1	Berliner	B-ORG	B-LOC	
2	Tageblatt	I-ORG	O	
3	.	O	O		
0		O	O
1	Nr	O	O	
2	.	O	O		
3	1	O	O	
4	.	O	O	
0		O	O
1	Seite	O	O
2	3	O	O
```

For our purposes we extend this format by adding
* a fifth column for an ``ID`` for the outer ``NE-TAG`` from an authority file (in this case, the [GND](https://www.dnb.de/EN/Professionell/Standardisierung/GND/gnd_node.html) is used) 
* column six for use as a variable ``url_id`` (see [Image Support](https://github.com/qurator-spk/neath/blob/master/User_Guide.md#image-support) for further details)
* finally, columns 7+ are used for storing ``left,right,top,bottom`` pixel coordinates for facsimile snippets 

Example (full):
```tsv
No.	TOKEN	NE-TAG	NE-EMB	GND-ID	url_id	left,right,top,bottom
# https://example.url/iiif/left,right,top,bottom/full/0/default.jpg
1	Donnerstag	O	O	-	0	174,352,358,390
2	,	O	O	-	0	174,352,358,390	
3	1	O	O	-	0	367,392,361,381
4	.	O	O	-	0	370,397,352,379
5	Januar	O	O	-	0	406,518,358,386
6	.	O	O	-	0	406,518,358,386	
0
1	Berliner	B-ORG	B-LOC	1086206452	0	816,984,358,388
2	Tageblatt	I-ORG	O	1086206452	0	1005,1208,360,387
3	.	O	O	-	0	1005,1208,360,387
0
1	Nr	O	O	-	0	1237,1288,360,382
2	.	O	O	-	0	1237,1288,360,382
3	1	O	O	-	0	1304,1326,361,381
4	.	O	O	-	0	1304,1326,361,381
0
1	Seite	O	O	-	0	1837,1926,361,392
2	3	O	O	-	0	1939,1967,364,385
```

#### Data preparation  
The source data that is used for annotation are OCR results in PAGE-XML format. We provide a [Python tool](https://github.com/qurator-spk/page2tsv) that help with transformation of PAGE-XML files into the TSV format required for use with [neath](https://github.com/qurator-spk/neath).

#### Provenance
The processing pipeline applied at the Berlin State Library comprises the follows steps: 

1. Layout Analysis & Textline Extraction       
Layout Analysis & Textline Extraction @[sbb_textline_detector](https://github.com/qurator-spk/sbb_textline_detector)
2. OCR & Word Segmentation    
OCR is based on [OCR-D](https://github.com/OCR-D)'s [ocrd_tesserocr](https://github.com/OCR-D/ocrd_tesserocr) which requires [Tesseract](https://github.com/tesseract-ocr/tesseract) **>= 4.1.0**. The [GT4HistOCR_2000000](https://ub-backup.bib.uni-mannheim.de/~stweil/ocrd-train/data/GT4HistOCR_2000000.traineddata) model, which is [trained](https://github.com/tesseract-ocr/tesstrain/wiki/GT4HistOCR) on the [GT4HistOCR](https://zenodo.org/record/1344132) corpus, is used. Further details are available in the [paper](https://arxiv.org/abs/1809.05501).
3. TSV Transformation   
A simple [Python tool](https://github.com/qurator-spk/page2tsv) is used for the transformation of the OCR results in [PAGE-XML](https://github.com/PRImA-Research-Lab/PAGE-XML) to [TSV](https://github.com/qurator-spk/neath/blob/master/docs/User_Guide.md#data-format).
4. Tokenization    
For tokenization, [SoMaJo](https://github.com/tsproisl/SoMaJo) is used.
5. Named Entity Recognition    
For Named Entity Recognition, a [BERT-Base](https://github.com/google-research/bert) model was trained for noisy OCR texts with historical spelling variation. [sbb_ner](https://github.com/qurator-spk/sbb_ner) is using a combination of unsupervised training on a large (~2.3m pages) [corpus of German OCR](https://zenodo.org/record/3257041) in combination with supervised training on a small (47k tokens) [annotated corpus](https://github.com/EuropeanaNewspapers/ner-corpora/tree/master/enp_DE.sbb.bio). Further details are available in the [paper](https://corpora.linguistik.uni-erlangen.de/data/konvens/proceedings/papers/KONVENS2019_paper_4.pdf).    

#### Keyboard-Navigation

| Key Combination|      Action      |
|:---------|:-------------------------------------------|
| Left     |  Move one cell left                        |
| Right    |  Move one cell right                       |
| Up       |  Move one row up                           |
| Down     |  Move one row down                         |
| PageDown |  Move page down                            |
| PageUp   |  Move page up                              |
| Crtl+Up  |  Move entire table one row up              |
| Crtl+Down|  Move entire table one row down            |
|----------|--------------------------------------------|
| s  t     |  Start new sentence in current row         |
| m  e     |  Merge current row with row above          |
| s  p     |  Create copy of current row                |
| d  l     |  Delete current row                        |
|----------|--------------------------------------------|
| backspace|  Set NE-TAG / NE-EMB to "O"                |
| b  p     |  Set NE-TAG / NE-EMB to "B-PER"            |
| b  l     |  Set NE-TAG / NE-EMB to "B-LOC"            |
| b  o     |  Set NE-TAG / NE-EMB to "B-ORG"            |
| b  w     |  Set NE-TAG / NE-EMB to "B-WORK"           |
| b  c     |  Set NE-TAG / NE-EMB to "B-CONF"           |
| b  e     |  Set NE-TAG / NE-EMB to "B-EVT"            |
| b  t     |  Set NE-TAG / NE-EMB to "B-TODO"           |
| i  p     |  Set NE-TAG / NE-EMB to "I-PER"            |
| i  l     |  Set NE-TAG / NE-EMB to "I-LOC"            |
| i  o     |  Set NE-TAG / NE-EMB to "I-ORG"            |
| i  w     |  Set NE-TAG / NE-EMB to "I-WORK"           |
| i  c     |  Set NE-TAG / NE-EMB to "I-CONF"           | 
| i  e     |  Set NE-TAG / NE-EMB to "I-EVT"            |
| i  t     |  Set NE-TAG / NE-EMB to "I-TODO"           |
|----------|--------------------------------------------|
| enter    | Edit TOKEN or GND-ID                       |
| esc      | Close TOKEN or GND-ID edit field without   |
|          | application of changes.                    |
|----------|--------------------------------------------|
| l a      | add one display row                        |
| l r      | remove on display row (minimum is 5)       |
|----------|--------------------------------------------|

#### Mouse-Navigation
* use mouse wheel to scroll up and down
* use navigation `<<` and `>>` to move faster

#### Image Support
Provided facsimile images are available online via the [iiif.io](https://iiif.io/) Image API, [neath](https://github.com/qurator-spk/neath) supports the embedding of facsimile snippets into its interface to help with data annotation and correction. 
This further requires that OCR with word segmentation is applied to the image to determine bounding boxes for tokens. 

The iiif-image-url contained in the source ``#`` can then be used as a replacement for ``url_id`` in combination with the token bounding boxes as ``left,right,top,bottom`` to obtain the facsimile snippet url and display the image in the leftmost column. Clicking on the facsimile snippet opens up a new tab with a larger context.

#### Tagging
* adding a tag
* removing a tag
* changing a tag
#### Text correction
* editing the token text
#### Tokenization correction
* merging two tokens
* splitting a token
* sentence boundaries

#### Saving progress
[neath](https://github.com/qurator-spk/neath) runs fully locally in the browser. Therefore it can not automatically save any changes you made to disk. You have to use the `Save Changes` button in order to so manually from time to time. If your browser automatically saves all downloads to your `Downloads` folder, you might want to configure it so that it instead prompts you where to save.

### 3. Annotation Guidelines
The most recent version of the [Annotation Guidelines](https://github.com/qurator-spk/neath/blob/master/Annotation_Guidelines.pdf) is included in this repository.
split docs 5 years ago			`# User Guide`
Update guide.md 5 years ago			`#### version 0.1`
init documentation (guide) 5 years ago
Update User_Guide.md 5 years ago			`### Table of contents`
			`[1. Introduction](https://github.com/qurator-spk/neath/blob/master/docs/User_Guide.md#1-introduction)`

			`[2. User Guide](https://github.com/qurator-spk/neath/blob/master/docs/User_Guide.md#2-user-guide)`

Update User_Guide.md 5 years ago			`[3. Annotation Guidelines](https://github.com/qurator-spk/neath/blob/master/docs/User_Guide.md#3-annotation-guidelines)`
Update User_Guide.md 5 years ago
Update guide.md 5 years ago			`### 1. Introduction`
Update User_Guide.md 5 years ago			`[neath](https://github.com/qurator-spk/neath) is a simple, browser-based tool for editing and annotating text with named entities to produce a corpus for training/testing/evaluation. It can be used to add or correct named entity BIO-tags in a TSV file and to correct the token text or tokenization (e.g. due to OCR/segmentation errors).`
Update User_Guide.md 5 years ago
			`[neath](https://github.com/qurator-spk/neath) is developed at the [Berlin State Library](http://staatsbibliothek-berlin.de/) for data annotation in the context of the [SoNAR-IDH](https://sonar.fh-potsdam.de/) project and the [QURATOR](https://qurator.ai/) project.`
init documentation (guide) 5 years ago
Update guide.md 5 years ago			`### 2. User Guide`
Update User_Guide.md 5 years ago
Update guide.md 5 years ago			`#### Technical Requirements`
Update User_Guide.md 5 years ago			`[neath](https://github.com/qurator-spk/neath) runs locally as a pure HTML+JavaScript webpage in your web browser. No software needs to be installed, but JavaScript has to be enabled in the browser. Any fairly recent browser should work, but only Chrome and Firefox are tested.`
Update User_Guide.md 5 years ago
Update User_Guide.md 5 years ago			`#### Data format`
			`The data format is based on the format used in the [GermEval2014 Named Entity Recognition Shared Task](https://sites.google.com/site/germeval2014ner/data). Text is encoded as one token per line, with name spans encoded in the BIO-scheme, provided as tab-separated values:`
Update User_Guide.md 5 years ago			* the first column contains either a `#`, which signals the source the sentence is cited from, or
			* the token position within the sentence ``>=1``
Update User_Guide.md 5 years ago			* sentence boundaries are indicated by ``0``
Update User_Guide.md 5 years ago			* the second column contains the token ``text``
			* outer entity spans are encoded in the third column ``NE-TAG``
			* embedded entity spans are encoded in the fourth column ``NE-EMB``
Update User_Guide.md 5 years ago
			`Example (simple):`
			```tsv
			`No. TOKEN NE-TAG NE-EMB`
			`# https://example.url`
			`1 Donnerstag O O`
			`2 , O O`
			`3 1 O O`
			`4 . O O`
			`5 Januar O O`
			`6 . O O`
			`0 O O`
			`1 Berliner B-ORG B-LOC`
			`2 Tageblatt I-ORG O`
			`3 . O O`
			`0 O O`
			`1 Nr O O`
			`2 . O O`
			`3 1 O O`
			`4 . O O`
			`0 O O`
			`1 Seite O O`
			`2 3 O O`
			```

			`For our purposes we extend this format by adding`
Update User_Guide.md 5 years ago			* a fifth column for an ``ID`` for the outer ``NE-TAG`` from an authority file (in this case, the [GND](https://www.dnb.de/EN/Professionell/Standardisierung/GND/gnd_node.html) is used)
restructure repo 5 years ago			* column six for use as a variable ``url_id`` (see [Image Support](https://github.com/qurator-spk/neath/blob/master/User_Guide.md#image-support) for further details)
Update User_Guide.md 5 years ago			* finally, columns 7+ are used for storing ``left,right,top,bottom`` pixel coordinates for facsimile snippets
Update User_Guide.md 5 years ago
			`Example (full):`
			```tsv
			`No. TOKEN NE-TAG NE-EMB GND-ID url_id left,right,top,bottom`
			`# https://example.url/iiif/left,right,top,bottom/full/0/default.jpg`
			`1 Donnerstag O O - 0 174,352,358,390`
			`2 , O O - 0 174,352,358,390`
			`3 1 O O - 0 367,392,361,381`
			`4 . O O - 0 370,397,352,379`
			`5 Januar O O - 0 406,518,358,386`
			`6 . O O - 0 406,518,358,386`
			`0`
			`1 Berliner B-ORG B-LOC 1086206452 0 816,984,358,388`
			`2 Tageblatt I-ORG O 1086206452 0 1005,1208,360,387`
			`3 . O O - 0 1005,1208,360,387`
			`0`
			`1 Nr O O - 0 1237,1288,360,382`
			`2 . O O - 0 1237,1288,360,382`
			`3 1 O O - 0 1304,1326,361,381`
			`4 . O O - 0 1304,1326,361,381`
			`0`
			`1 Seite O O - 0 1837,1926,361,392`
			`2 3 O O - 0 1939,1967,364,385`
			```
Update User_Guide.md 5 years ago
Update User_Guide.md 5 years ago			`#### Data preparation`
restructure repo 5 years ago			`The source data that is used for annotation are OCR results in PAGE-XML format. We provide a [Python tool](https://github.com/qurator-spk/page2tsv) that help with transformation of PAGE-XML files into the TSV format required for use with [neath](https://github.com/qurator-spk/neath).`

			`#### Provenance`
			`The processing pipeline applied at the Berlin State Library comprises the follows steps:`

			`1. Layout Analysis & Textline Extraction`
			`Layout Analysis & Textline Extraction @[sbb_textline_detector](https://github.com/qurator-spk/sbb_textline_detector)`
			`2. OCR & Word Segmentation`
			OCR is based on [OCR-D](https://github.com/OCR-D)'s [ocrd_tesserocr](https://github.com/OCR-D/ocrd_tesserocr) which requires [Tesseract](https://github.com/tesseract-ocr/tesseract) >= 4.1.0. The [GT4HistOCR_2000000](https://ub-backup.bib.uni-mannheim.de/~stweil/ocrd-train/data/GT4HistOCR_2000000.traineddata) model, which is [trained](https://github.com/tesseract-ocr/tesstrain/wiki/GT4HistOCR) on the [GT4HistOCR](https://zenodo.org/record/1344132) corpus, is used. Further details are available in the [paper](https://arxiv.org/abs/1809.05501).
			`3. TSV Transformation`
			`A simple [Python tool](https://github.com/qurator-spk/page2tsv) is used for the transformation of the OCR results in [PAGE-XML](https://github.com/PRImA-Research-Lab/PAGE-XML) to [TSV](https://github.com/qurator-spk/neath/blob/master/docs/User_Guide.md#data-format).`
			`4. Tokenization`
			`For tokenization, [SoMaJo](https://github.com/tsproisl/SoMaJo) is used.`
			`5. Named Entity Recognition`
			For Named Entity Recognition, a [BERT-Base](https://github.com/google-research/bert) model was trained for noisy OCR texts with historical spelling variation. [sbb_ner](https://github.com/qurator-spk/sbb_ner) is using a combination of unsupervised training on a large (~2.3m pages) [corpus of German OCR](https://zenodo.org/record/3257041) in combination with supervised training on a small (47k tokens) [annotated corpus](https://github.com/EuropeanaNewspapers/ner-corpora/tree/master/enp_DE.sbb.bio). Further details are available in the [paper](https://corpora.linguistik.uni-erlangen.de/data/konvens/proceedings/papers/KONVENS2019_paper_4.pdf).
Update User_Guide.md 5 years ago
Update User_Guide.md 5 years ago			`#### Keyboard-Navigation`
implement keyboard support 5 years ago
Update User_Guide.md 5 years ago			`\| Key Combination\| Action \|`
implement keyboard support 5 years ago			`\|:---------\|:-------------------------------------------\|`
			`\| Left \| Move one cell left \|`
			`\| Right \| Move one cell right \|`
			`\| Up \| Move one row up \|`
			`\| Down \| Move one row down \|`
			`\| PageDown \| Move page down \|`
			`\| PageUp \| Move page up \|`
Add Ctrl+Down/Up move entire table one row Up/Down 5 years ago			`\| Crtl+Up \| Move entire table one row up \|`
			`\| Crtl+Down\| Move entire table one row down \|`
implement keyboard support 5 years ago			`\|----------\|--------------------------------------------\|`
			`\| s t \| Start new sentence in current row \|`
			`\| m e \| Merge current row with row above \|`
- This commit should fix bugs: #38, #36. - I cannot reproduce bug #35 any more. - Added much more robust word numbering. - Added key combination d-l to delete current line. - Remove 0 key combination. Word numbering is now done automatically. Therefore this functionality is not required any more. - Fixed event listener accumulation bug (slowdown of app after some time). - Added sanitization code that removes any line breaks in tokens. - Code simplification applied. 5 years ago			`\| s p \| Create copy of current row \|`
			`\| d l \| Delete current row \|`
implement keyboard support 5 years ago			`\|----------\|--------------------------------------------\|`
			`\| backspace\| Set NE-TAG / NE-EMB to "O" \|`
			`\| b p \| Set NE-TAG / NE-EMB to "B-PER" \|`
			`\| b l \| Set NE-TAG / NE-EMB to "B-LOC" \|`
			`\| b o \| Set NE-TAG / NE-EMB to "B-ORG" \|`
			`\| b w \| Set NE-TAG / NE-EMB to "B-WORK" \|`
			`\| b c \| Set NE-TAG / NE-EMB to "B-CONF" \|`
			`\| b e \| Set NE-TAG / NE-EMB to "B-EVT" \|`
			`\| b t \| Set NE-TAG / NE-EMB to "B-TODO" \|`
			`\| i p \| Set NE-TAG / NE-EMB to "I-PER" \|`
			`\| i l \| Set NE-TAG / NE-EMB to "I-LOC" \|`
			`\| i o \| Set NE-TAG / NE-EMB to "I-ORG" \|`
			`\| i w \| Set NE-TAG / NE-EMB to "I-WORK" \|`
			`\| i c \| Set NE-TAG / NE-EMB to "I-CONF" \|`
			`\| i e \| Set NE-TAG / NE-EMB to "I-EVT" \|`
			`\| i t \| Set NE-TAG / NE-EMB to "I-TODO" \|`
			`\|----------\|--------------------------------------------\|`
			`\| enter \| Edit TOKEN or GND-ID \|`
Add two key combinations (l-a, and l-r) that permit change of number of display rows. 5 years ago			`\| esc \| Close TOKEN or GND-ID edit field without \|`
			`\| \| application of changes. \|`
			`\|----------\|--------------------------------------------\|`
			`\| l a \| add one display row \|`
			`\| l r \| remove on display row (minimum is 5) \|`
implement keyboard support 5 years ago			`\|----------\|--------------------------------------------\|`

			`#### Mouse-Navigation`
Update User_Guide.md 5 years ago			`* use mouse wheel to scroll up and down`
			* use navigation `<<` and `>>` to move faster
Update User_Guide.md 5 years ago
Update User_Guide.md 5 years ago			`#### Image Support`
Update User_Guide.md 5 years ago			`Provided facsimile images are available online via the [iiif.io](https://iiif.io/) Image API, [neath](https://github.com/qurator-spk/neath) supports the embedding of facsimile snippets into its interface to help with data annotation and correction.`
Update User_Guide.md 5 years ago			`This further requires that OCR with word segmentation is applied to the image to determine bounding boxes for tokens.`
Update User_Guide.md 5 years ago
Update User_Guide.md 5 years ago			The iiif-image-url contained in the source ``#`` can then be used as a replacement for ``url_id`` in combination with the token bounding boxes as ``left,right,top,bottom`` to obtain the facsimile snippet url and display the image in the leftmost column. Clicking on the facsimile snippet opens up a new tab with a larger context.
Update User_Guide.md 5 years ago
Update User_Guide.md 5 years ago			`#### Tagging`
			`* adding a tag`
			`* removing a tag`
			`* changing a tag`
Update User_Guide.md 5 years ago			`#### Text correction`
Update User_Guide.md 5 years ago			`* editing the token text`
Update User_Guide.md 5 years ago			`#### Tokenization correction`
Update User_Guide.md 5 years ago			`* merging two tokens`
			`* splitting a token`
Update User_Guide.md 5 years ago			`* sentence boundaries`
Update guide.md 5 years ago
Update User_Guide.md 5 years ago			`#### Saving progress`
			[neath](https://github.com/qurator-spk/neath) runs fully locally in the browser. Therefore it can not automatically save any changes you made to disk. You have to use the `Save Changes` button in order to so manually from time to time. If your browser automatically saves all downloads to your `Downloads` folder, you might want to configure it so that it instead prompts you where to save.
Update guide.md 5 years ago
Update User_Guide.md 5 years ago			`### 3. Annotation Guidelines`
restructure repo 5 years ago			`The most recent version of the [Annotation Guidelines](https://github.com/qurator-spk/neath/blob/master/Annotation_Guidelines.pdf) is included in this repository.`