You cannot select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
 
 
Clemens Neudecker 026281a85b
Update README.md
5 years ago
Annotation_Guidelines.pdf v1.8 of Annotation Guidelines HT @snmznl 5 years ago
LICENSE add LICENSE 5 years ago
README.md Update README.md 5 years ago
example.tsv replace GND-ID with generic ID 5 years ago
neat.html remove unused html 5 years ago
neat.js add enlarge link 5 years ago

README.md

neat: named entity annotation tool


Screenshot

Table of contents

1. Introduction

2. User Guide

   2.1 Technical requirements

   2.2 Installation

   2.3 Data format

   2.4 Data preparation

   2.5 Keyboard navigation

   2.6 Mouse navigation

   2.7 Image support

   2.8 Saving progress

3. Annotation Guidelines

1. Introduction

neat is a simple, browser-based tool for editing and annotating text with named entities to produce a dataset for training/testing/evaluation. It can be used to add or correct named entity BIO-tags in a TSV file and to correct the token text or tokenization (e.g. due to OCR/segmentation errors).

neat is developed at the Berlin State Library for data annotation in the SoNAR-IDH project and the QURATOR project.

2. User Guide

2.1 Technical Requirements

neat runs locally as a pure HTML+JavaScript webpage in your web browser. No software needs to be installed, but JavaScript has to be enabled in the browser.

2.2. Installation

Clone the repo using git clone https://github.com/qurator-spk/neat.git or download the ZIP. Make sure you have neat.html and neat.js in the same directory and open neat.html in a browser. Any fairly recent browser should work, but only Chrome and Firefox are tested.

2.3 Data format

The data format is based on the format used in the GermEval2014 Named Entity Recognition Shared Task. Text is encoded as one token per line, with name spans encoded in the BIO-scheme, provided as tab-separated values:

  • the first column contains either a #, which signals the source the sentence is cited from, or
  • the token position within the sentence >=1
  • sentence boundaries are indicated by 0
  • the second column contains the token text
  • outer entity spans are encoded in the third column NE-TAG
  • embedded entity spans are encoded in the fourth column NE-EMB

Example (simple):

No.	TOKEN	NE-TAG	NE-EMB
# https://example.url
1	Donnerstag	O	O
2	,	O	O
3	1	O	O	
4	.	O	O	
5	Januar	O	O	
6	.	O	O		
0		O	O
1	Berliner	B-ORG	B-LOC	
2	Tageblatt	I-ORG	O	
3	.	O	O		
0		O	O
1	Nr	O	O	
2	.	O	O		
3	1	O	O	
4	.	O	O	
0		O	O
1	Seite	O	O
2	3	O	O

For our purposes we extend this format by adding

  • a fifth column for an ID for the outer NE-TAG from an authority file
  • column six for use as a variable url_id (see Image Support for further details)
  • finally, columns 7+ are used for storing left,right,top,bottom pixel coordinates for image snippets

Example (full):

No.	TOKEN	NE-TAG	NE-EMB	ID	url_id	left,right,top,bottom
# https://example.url/iiif/left,right,top,bottom/full/0/default.jpg
1	Donnerstag	O	O	-	0	174,352,358,390
2	,	O	O	-	0	174,352,358,390	
3	1	O	O	-	0	367,392,361,381
4	.	O	O	-	0	370,397,352,379
5	Januar	O	O	-	0	406,518,358,386
6	.	O	O	-	0	406,518,358,386	
0
1	Berliner	B-ORG	B-LOC	1086206452	0	816,984,358,388
2	Tageblatt	I-ORG	O	1086206452	0	1005,1208,360,387
3	.	O	O	-	0	1005,1208,360,387
0
1	Nr	O	O	-	0	1237,1288,360,382
2	.	O	O	-	0	1237,1288,360,382
3	1	O	O	-	0	1304,1326,361,381
4	.	O	O	-	0	1304,1326,361,381
0
1	Seite	O	O	-	0	1837,1926,361,392
2	3	O	O	-	0	1939,1967,364,385

2.4 Data preparation

The source data that is used for annotation are OCR results in PAGE-XML format. We provide a Python tool for the transformation of PAGE-XML OCR files into the TSV format used by neat.

2.5 Keyboard-Navigation

Key Combination Action
Left Move one cell left
Right Move one cell right
Up Move one row up
Down Move one row down
PageDown Move page down
PageUp Move page up
Crtl+Up Move entire table one row up
Crtl+Down Move entire table one row down
---------- --------------------------------------------
s t Start new sentence in current row
m e Merge current row with row above
s p Create copy of current row
d l Delete current row
---------- --------------------------------------------
backspace Set NE-TAG / NE-EMB to "O"
b p Set NE-TAG / NE-EMB to "B-PER"
b l Set NE-TAG / NE-EMB to "B-LOC"
b o Set NE-TAG / NE-EMB to "B-ORG"
b w Set NE-TAG / NE-EMB to "B-WORK"
b c Set NE-TAG / NE-EMB to "B-CONF"
b e Set NE-TAG / NE-EMB to "B-EVT"
b t Set NE-TAG / NE-EMB to "B-TODO"
i p Set NE-TAG / NE-EMB to "I-PER"
i l Set NE-TAG / NE-EMB to "I-LOC"
i o Set NE-TAG / NE-EMB to "I-ORG"
i w Set NE-TAG / NE-EMB to "I-WORK"
i c Set NE-TAG / NE-EMB to "I-CONF"
i e Set NE-TAG / NE-EMB to "I-EVT"
i t Set NE-TAG / NE-EMB to "I-TODO"
---------- --------------------------------------------
enter Edit TOKEN or GND-ID
esc Close TOKEN or GND-ID edit field without
application of changes.
---------- --------------------------------------------
l a add one display row
l r remove on display row (minimum is 5)
---------- --------------------------------------------

2.6 Mouse-Navigation

  • use mouse wheel to scroll up and down

  • left-click << and >> to move 15 rows up or down

  • left-click O in the NE-TAG or NE-EMB columns to open the drop-down menu and select any of the supported NE-Tags to tag a token or change an existing tag to another one

  • left-click a tag in the NE-TAG or NE-EMB columns and subsequently select O to remove a wrong tag

  • left-click a token in the TOKEN column to edit/correct the text content

  • left-click the POSITION of a row and select split from the drop-down menu to create a copy of the current row

  • left-click the POSITION of a row and select merge from the drop-down menu to merge the current row with the row above

  • left-click the POSITION of a row and select start-sentence from the drop-down menu to start a new sentence

2.7 Image Support

Provided facsimile images are available via the iiif.io Image API, neat supports the embedding of image snippets into its interface to assist data annotation and correction. This requires that the PAGE-XML OCR contains word bounding boxes.

2.8 Saving progress

neat runs fully locally in the browser. Therefore it can not automatically save any changes you made to disk. You have to use the Save Changes button to do so manually from time to time.

3. Annotation Guidelines

Annotation Guidelines