[neat](https://github.com/qurator-spk/neat) is a simple, browser-based tool for editing and annotating text with named entities to produce a dataset for training/testing/evaluation. It can be used to add or correct named entity BIO-tags in a TSV file and to correct the token text or tokenization (e.g. due to OCR/segmentation errors).
[neat](https://github.com/qurator-spk/neat) is developed at the [Berlin State Library](https://staatsbibliothek-berlin.de/) for data annotation in the [SoNAR-IDH](https://sonar.fh-potsdam.de/) project and the [QURATOR](https://qurator.ai/) project.
[neat](https://github.com/qurator-spk/neat) runs locally as a pure HTML+JavaScript webpage in your web browser. No software needs to be installed, but JavaScript has to be enabled in the browser.
Clone the repo using ``git clone https://github.com/qurator-spk/neat.git`` or download the [ZIP](https://github.com/qurator-spk/neat/archive/master.zip). Make sure you have ``neat.html`` and ``neat.js`` in the same directory and open ``neat.html`` in a browser. Any fairly recent browser should work, but only Chrome and Firefox are tested.
The data format is based on the format used in the [GermEval2014 Named Entity Recognition Shared Task](https://sites.google.com/site/germeval2014ner/data). Text is encoded as one token per line, with name spans encoded in the BIO-scheme, provided as tab-separated values:
* the first column contains either a `#`, which signals the source the sentence is cited from, or
* the token position within the sentence ``>=1``
* sentence boundaries are indicated by ``0``
* the second column contains the token ``text``
* outer entity spans are encoded in the third column ``NE-TAG``
* embedded entity spans are encoded in the fourth column ``NE-EMB``
* a fifth column for an ``ID`` for the outer ``NE-TAG`` from an authority file
* column six for use as a variable ``url_id`` (see [Image Support](https://github.com/qurator-spk/neat/blob/master/README.md#27-image-support) for further details)
* finally, columns 7+ are used for storing ``left,right,top,bottom`` pixel coordinates for image snippets
The source data that is used for annotation are OCR results in [PAGE-XML](https://github.com/PRImA-Research-Lab/PAGE-XML) format. We provide a [Python tool](https://github.com/qurator-spk/page2tsv) for the transformation of [PAGE-XML](https://github.com/PRImA-Research-Lab/PAGE-XML) OCR files into the [TSV format](https://github.com/qurator-spk/neat/blob/master/README.md#23-data-format) used by [neat](https://github.com/qurator-spk/neat).
* left-click `<<` and `>>` to move 15 rows up or down
* left-click `O` in the `NE-TAG` or `NE-EMB` columns to open the drop-down menu and select any of the supported NE-Tags to tag a token or change an existing tag to another one
* left-click a tag in the `NE-TAG` or `NE-EMB` columns and subsequently select `O` to remove a wrong tag
* left-click a token in the `TOKEN` column to edit/correct the text content
* left-click the `POSITION` of a row and select `split` from the drop-down menu to create a copy of the current row
* left-click the `POSITION` of a row and select `merge` from the drop-down menu to merge the current row with the row above
* left-click the `POSITION` of a row and select `start-sentence` from the drop-down menu to start a new sentence
Provided facsimile images are available via the [iiif.io](https://iiif.io/) Image API, [neat](https://github.com/qurator-spk/neat) supports the embedding of image snippets into its interface to assist data annotation and correction. This requires that the PAGE-XML OCR contains word bounding boxes.
[neat](https://github.com/qurator-spk/neat) runs fully locally in the browser. Therefore it can not automatically save any changes you made to disk. You have to use the `Save Changes` button to do so manually from time to time.