mirror of https://github.com/qurator-spk/neat.git synced 2025-08-17 13:19:53 +02:00

No description

Find a file

Kai 66ace58e2e add enlarge link		2020-03-13 17:21:08 +01:00
Annotation_Guidelines.pdf	v1.8 of Annotation Guidelines HT @snmznl	2020-03-11 16:13:32 +01:00
example.tsv	replace GND-ID with generic ID	2020-03-12 23:49:13 +01:00
LICENSE	add LICENSE	2019-11-06 13:19:23 +01:00
neat.html	remove unused html	2020-03-13 16:10:14 +01:00
neat.js	add enlarge link	2020-03-13 17:21:08 +01:00
README.md	replace GND-ID with generic ID	2020-03-12 23:49:13 +01:00

README.md

neat: named entity annotation tool

version 0.1

1. Introduction

2. User Guide

2.1 Technical requirements

2.6 Keyboard navigation

2.7 Mouse navigation

2.8 Image support

2.9 Saving progress

3. Annotation Guidelines

1. Introduction

neat is a simple, browser-based tool for editing and annotating text with named entities to produce a corpus for training/testing/evaluation. It can be used to add or correct named entity BIO-tags in a TSV file and to correct the token text or tokenization (e.g. due to OCR/segmentation errors).

neat is developed at the Berlin State Library for data annotation in the context of the SoNAR-IDH project and the QURATOR project.

2. User Guide

2.1 Technical Requirements

neat runs locally as a pure HTML+JavaScript webpage in your web browser. No software needs to be installed, but JavaScript has to be enabled in the browser.

2.2. Installation

Simply clone the repo using git clone https://github.com/qurator-spk/neat.git or download the ZIP. Make sure you have at minimum neat.html and neat.js residing in a local directory, then it is sufficient to just open neat.html in a browser. Any fairly recent browser should work, but only Chrome and Firefox are tested.

2.3 Data format

The data format is based on the format used in the GermEval2014 Named Entity Recognition Shared Task. Text is encoded as one token per line, with name spans encoded in the BIO-scheme, provided as tab-separated values:

the first column contains either a #, which signals the source the sentence is cited from, or
the token position within the sentence >=1
sentence boundaries are indicated by 0
the second column contains the token text
outer entity spans are encoded in the third column NE-TAG
embedded entity spans are encoded in the fourth column NE-EMB

Example (simple):

No.	TOKEN	NE-TAG	NE-EMB
# https://example.url
1	Donnerstag	O	O
2	,	O	O
3	1	O	O	
4	.	O	O	
5	Januar	O	O	
6	.	O	O		
0		O	O
1	Berliner	B-ORG	B-LOC	
2	Tageblatt	I-ORG	O	
3	.	O	O		
0		O	O
1	Nr	O	O	
2	.	O	O		
3	1	O	O	
4	.	O	O	
0		O	O
1	Seite	O	O
2	3	O	O

For our purposes we extend this format by adding

a fifth column for an ID for the outer NE-TAG from an authority file (in this case Wikidata is used)
column six for use as a variable url_id (see Image Support for further details)
finally, columns 7+ are used for storing left,right,top,bottom pixel coordinates for facsimile snippets

Example (full):

No.	TOKEN	NE-TAG	NE-EMB	ID	url_id	left,right,top,bottom
# https://example.url/iiif/left,right,top,bottom/full/0/default.jpg
1	Donnerstag	O	O	-	0	174,352,358,390
2	,	O	O	-	0	174,352,358,390	
3	1	O	O	-	0	367,392,361,381
4	.	O	O	-	0	370,397,352,379
5	Januar	O	O	-	0	406,518,358,386
6	.	O	O	-	0	406,518,358,386	
0
1	Berliner	B-ORG	B-LOC	1086206452	0	816,984,358,388
2	Tageblatt	I-ORG	O	1086206452	0	1005,1208,360,387
3	.	O	O	-	0	1005,1208,360,387
0
1	Nr	O	O	-	0	1237,1288,360,382
2	.	O	O	-	0	1237,1288,360,382
3	1	O	O	-	0	1304,1326,361,381
4	.	O	O	-	0	1304,1326,361,381
0
1	Seite	O	O	-	0	1837,1926,361,392
2	3	O	O	-	0	1939,1967,364,385

2.4 Data preparation

The source data that is used for annotation are OCR results in PAGE-XML format. We provide a Python tool that supports the transformation of PAGE-XML OCR files into the TSV format required for use with neat.

2.5 Provenance

The processing pipeline applied at the Berlin State Library comprises the follows steps:

Layout Analysis & Textline Extraction
Layout Analysis & Textline Extraction @sbb_textline_detector
OCR & Word Segmentation
OCR is based on OCR-D's ocrd_tesserocr which requires Tesseract >= 4.1.0. The GT4HistOCR_2000000 model, which is trained on the GT4HistOCR corpus, is used. Further details are available in the paper.
TSV Transformation
A simple Python tool is used for the transformation of the OCR results in PAGE-XML to TSV.
Tokenization
For tokenization, SoMaJo is used.
Named Entity Recognition
For Named Entity Recognition, a BERT-Base model was trained for noisy OCR texts with historical spelling variation. sbb_ner is using a combination of unsupervised training on a large (~2.3m pages) corpus of German OCR in combination with supervised training on a small (47k tokens) annotated corpus. Further details are available in the paper.

Key Combination	Action
Left	Move one cell left
Right	Move one cell right
Up	Move one row up
Down	Move one row down
PageDown	Move page down
PageUp	Move page up
Crtl+Up	Move entire table one row up
Crtl+Down	Move entire table one row down
----------	--------------------------------------------
s t	Start new sentence in current row
m e	Merge current row with row above
s p	Create copy of current row
d l	Delete current row
----------	--------------------------------------------
backspace	Set NE-TAG / NE-EMB to "O"
b p	Set NE-TAG / NE-EMB to "B-PER"
b l	Set NE-TAG / NE-EMB to "B-LOC"
b o	Set NE-TAG / NE-EMB to "B-ORG"
b w	Set NE-TAG / NE-EMB to "B-WORK"
b c	Set NE-TAG / NE-EMB to "B-CONF"
b e	Set NE-TAG / NE-EMB to "B-EVT"
b t	Set NE-TAG / NE-EMB to "B-TODO"
i p	Set NE-TAG / NE-EMB to "I-PER"
i l	Set NE-TAG / NE-EMB to "I-LOC"
i o	Set NE-TAG / NE-EMB to "I-ORG"
i w	Set NE-TAG / NE-EMB to "I-WORK"
i c	Set NE-TAG / NE-EMB to "I-CONF"
i e	Set NE-TAG / NE-EMB to "I-EVT"
i t	Set NE-TAG / NE-EMB to "I-TODO"
----------	--------------------------------------------
enter	Edit TOKEN or GND-ID
esc	Close TOKEN or GND-ID edit field without
	application of changes.
----------	--------------------------------------------
l a	add one display row
l r	remove on display row (minimum is 5)
----------	--------------------------------------------

use mouse wheel to scroll up and down
left-click << and >> to move 15 rows up or down
left-click O in the NE-TAG or NE-EMB columns to open the drop-down menu and select any of the supported NE-Tags to tag a token or change an existing tag to another one
left-click a tag in the NE-TAG or NE-EMB columns and subsequently select O to remove a wrong tag
left-click a token in the TOKEN column to edit/correct the text content
left-click the POSITION of a row and select split from the drop-down menu to create a copy of the current row
left-click the POSITION of a row and select merge from the drop-down menu to merge the current row with the row above
left-click the POSITION of a row and select start-sentence from the drop-down menu to start a new sentence

2.8 Image Support

Provided facsimile images are available online via the iiif.io Image API, neat supports the embedding of facsimile snippets into its interface to help with data annotation and correction. This further requires that OCR with word segmentation is applied to the image to determine bounding boxes for tokens.

The iiif-image-url contained in the source # can then be used as a replacement for url_id in combination with the token bounding boxes as left,right,top,bottom to obtain the facsimile snippet url and display the image in the leftmost column. Clicking on the facsimile snippet opens up a new tab with a larger context.

2.9 Saving progress

neat runs fully locally in the browser. Therefore it can not automatically save any changes you made to disk. You have to use the Save Changes button in order to so manually from time to time. If your browser automatically saves all downloads to your Downloads folder, you might want to configure it so that it instead prompts you where to save.

3. Annotation Guidelines

The most recent version of the Annotation Guidelines is included in this repository.

README.md

neat: named entity annotation tool

version 0.1

Table of contents

1. Introduction

2. User Guide

2.1 Technical Requirements

2.2. Installation

2.3 Data format

2.4 Data preparation

2.5 Provenance

2.6 Keyboard-Navigation

2.7 Mouse-Navigation

2.8 Image Support

2.9 Saving progress

3. Annotation Guidelines