1
0
Fork 0
mirror of https://github.com/qurator-spk/neat.git synced 2025-07-06 17:09:54 +02:00
neat/docs/User_Guide.md

104 lines
4.1 KiB
Markdown
Raw Normal View History

2019-10-17 21:15:47 +02:00
# User Guide
2019-10-16 22:42:20 +02:00
#### version 0.1
2019-10-16 21:46:46 +02:00
2019-10-16 21:49:09 +02:00
### 1. Introduction
2019-11-14 22:36:52 +01:00
[neath](https://github.com/qurator-spk/neath) is a simple, browser-based tool for editing and annotating text with named entities to produce a corpus for training/testing/evaluation. It can be used to add or correct named entity BIO-tags in a TSV file and to correct the token text or segmentation (e.g. due to OCR errors).
[neath](https://github.com/qurator-spk/neath) is developed at the [Berlin State Library](http://staatsbibliothek-berlin.de/) for data annotation in the context of the [SoNAR-IDH](https://sonar.fh-potsdam.de/) project and the [QURATOR](https://qurator.ai/) project.
2019-10-16 21:46:46 +02:00
2019-10-16 21:49:09 +02:00
### 2. User Guide
2019-10-16 22:40:54 +02:00
#### Technical Requirements
2019-11-06 13:18:12 +01:00
[neath](https://github.com/qurator-spk/neath) runs locally as a pure HTML+JavaScript webpage in your web browser. No software needs to be installed, but JavaScript has to be enabled in the browser. Any fairly recent browser should work, but only Chrome and Firefox are tested.
2019-10-17 21:35:17 +02:00
#### Data input format
2019-11-14 22:36:52 +01:00
The input data format is based on the format used in the [GermEval2014 Named Entity Recognition Shared Task](https://sites.google.com/site/germeval2014ner/data). Text is encoded as one token per line, with name spans encoded in the BIO-scheme, provided as tab-separated values:
2019-11-14 22:44:00 +01:00
* the first column contains either a `#`, which signals the source the sentence is cited from, or
* the token position within the sentence ``>=1``
2019-11-14 22:26:50 +01:00
* sentence boundaries are indicated by ``0``
2019-11-14 22:44:00 +01:00
* the second column contains the token ``text``
* outer entity spans are encoded in the third column ``NE-TAG``
* embedded entity spans are encoded in the fourth column ``NE-EMB``
2019-11-14 22:26:50 +01:00
Example (simple):
```tsv
No. TOKEN NE-TAG NE-EMB
# https://example.url
1 Donnerstag O O
2 , O O
3 1 O O
4 . O O
5 Januar O O
6 . O O
0 O O
1 Berliner B-ORG B-LOC
2 Tageblatt I-ORG O
3 . O O
0 O O
1 Nr O O
2 . O O
3 1 O O
4 . O O
0 O O
1 Seite O O
2 3 O O
```
For our purposes we extend this format by adding
2019-11-14 22:44:00 +01:00
* a fifth column for an ``ID`` for the outer ``NE-TAG`` from an authority file (in this case, the [GND](https://www.dnb.de/EN/Professionell/Standardisierung/GND/gnd_node.html) is used)
2019-11-14 22:49:56 +01:00
* column six for use as a variable ``url_id`` (see [Image Support](https://github.com/qurator-spk/neath/blob/master/docs/User_Guide.md#image-support) for further details)
* finally, columns 7+ are used for storing ``left,right,top,bottom`` pixel coordinates for facsimile snippets
2019-11-14 22:26:50 +01:00
Example (full):
```tsv
No. TOKEN NE-TAG NE-EMB GND-ID url_id left,right,top,bottom
# https://example.url/iiif/left,right,top,bottom/full/0/default.jpg
1 Donnerstag O O - 0 174,352,358,390
2 , O O - 0 174,352,358,390
3 1 O O - 0 367,392,361,381
4 . O O - 0 370,397,352,379
5 Januar O O - 0 406,518,358,386
6 . O O - 0 406,518,358,386
0
1 Berliner B-ORG B-LOC 1086206452 0 816,984,358,388
2 Tageblatt I-ORG O 1086206452 0 1005,1208,360,387
3 . O O - 0 1005,1208,360,387
0
1 Nr O O - 0 1237,1288,360,382
2 . O O - 0 1237,1288,360,382
3 1 O O - 0 1304,1326,361,381
4 . O O - 0 1304,1326,361,381
0
1 Seite O O - 0 1837,1926,361,392
2 3 O O - 0 1939,1967,364,385
```
2019-11-06 13:18:12 +01:00
2019-10-17 21:35:17 +02:00
#### Data preparation
2019-11-06 13:18:12 +01:00
We also provide some [Python tools](https://github.com/qurator-spk/neath/tree/master/tools) that help with data wrangling.
2019-10-17 22:44:15 +02:00
#### Navigation
* use mouse wheel to scroll up and down
* use navigation `<<` and `>>` to move faster
2019-11-14 22:49:56 +01:00
#### Image Support
2019-10-17 22:44:15 +02:00
#### Tagging
* adding a tag
* removing a tag
* changing a tag
#### OCR correction
* editing the token text
#### Segmentation correction
* merging two tokens
* splitting a token
2019-11-14 22:49:56 +01:00
* sentence boundaries
2019-10-16 22:33:08 +02:00
#### Data export/Saving progress
2019-11-06 13:18:12 +01:00
[neath](https://github.com/qurator-spk/neath) runs fully locally in the browser. Therefore it can not automatically save any changes you made to disk. You have to use the `Save Changes` button in order to so manually from time to time.
2019-10-16 22:22:46 +02:00
2019-10-16 22:59:32 +02:00
If your browser automatically saves all downloads to your `Downloads` folder, you might want to configure it so that it instead prompts you where to save.
2019-10-16 22:22:46 +02:00
2019-10-16 22:40:54 +02:00
Configuration option in Firefox:
2019-10-16 22:22:46 +02:00
2019-11-01 11:49:25 +01:00
![Screenshot](./../assets/firefox.png)
2019-10-16 22:22:46 +02:00
Configuration option in Chrome:
2019-11-01 11:49:25 +01:00
![Screenshot](./../assets/chrome.png)
2019-10-17 21:15:47 +02:00
### 3. FAQ