final name

pull/52/head
cneud 5 years ago
parent 6adee57607
commit dbafbc3261

@ -1,46 +1,46 @@
# neath: named entity annotation tool
# neat: named entity annotation tool
#### version 0.1
---
![Screenshot](https://user-images.githubusercontent.com/952378/72167036-ad2c6680-33ca-11ea-980f-5859e5155877.png)
---
### Table of contents
[1. Introduction](https://github.com/qurator-spk/neath/blob/master/README.md#1-introduction)
[1. Introduction](https://github.com/qurator-spk/neat/blob/master/README.md#1-introduction)
[2. User Guide](https://github.com/qurator-spk/neath/blob/master/README.md#2-user-guide)
[2. User Guide](https://github.com/qurator-spk/neat/blob/master/README.md#2-user-guide)
   [2.1 Technical requirements](https://github.com/qurator-spk/neath/blob/master/README.md#21-technical-requirements)
   [2.1 Technical requirements](https://github.com/qurator-spk/neat/blob/master/README.md#21-technical-requirements)
   [2.2 Installation](https://github.com/qurator-spk/neath/blob/master/README.md#22-installation)
   [2.2 Installation](https://github.com/qurator-spk/neat/blob/master/README.md#22-installation)
   [2.3 Data format](https://github.com/qurator-spk/neath/blob/master/README.md#23-data-format)
   [2.3 Data format](https://github.com/qurator-spk/neat/blob/master/README.md#23-data-format)
   [2.4 Data preparation](https://github.com/qurator-spk/neath/blob/master/README.md#24-data-preparation)
   [2.4 Data preparation](https://github.com/qurator-spk/neat/blob/master/README.md#24-data-preparation)
   [2.5 Provenance](https://github.com/qurator-spk/neath/blob/master/README.md#25-provenance)
   [2.5 Provenance](https://github.com/qurator-spk/neat/blob/master/README.md#25-provenance)
   [2.6 Keyboard navigation](https://github.com/qurator-spk/neath/blob/master/README.md#26-keyboard-navigation)
   [2.6 Keyboard navigation](https://github.com/qurator-spk/neat/blob/master/README.md#26-keyboard-navigation)
   [2.7 Mouse navigation](https://github.com/qurator-spk/neath/blob/master/README.md#27-mouse-navigation)
   [2.7 Mouse navigation](https://github.com/qurator-spk/neat/blob/master/README.md#27-mouse-navigation)
   [2.8 Image support](https://github.com/qurator-spk/neath/blob/master/README.md#28-image-support)
   [2.8 Image support](https://github.com/qurator-spk/neat/blob/master/README.md#28-image-support)
   [2.9 Saving progress](https://github.com/qurator-spk/neath/blob/master/README.md#29-saving-progress)
   [2.9 Saving progress](https://github.com/qurator-spk/neat/blob/master/README.md#29-saving-progress)
[3. Annotation Guidelines](https://github.com/qurator-spk/neath/blob/master/README.md#3-annotation-guidelines)
[3. Annotation Guidelines](https://github.com/qurator-spk/neat/blob/master/README.md#3-annotation-guidelines)
### 1. Introduction
[neath](https://github.com/qurator-spk/neath) is a simple, browser-based tool for editing and annotating text with named entities to produce a corpus for training/testing/evaluation. It can be used to add or correct named entity BIO-tags in a TSV file and to correct the token text or tokenization (e.g. due to OCR/segmentation errors).
[neat](https://github.com/qurator-spk/neat) is a simple, browser-based tool for editing and annotating text with named entities to produce a corpus for training/testing/evaluation. It can be used to add or correct named entity BIO-tags in a TSV file and to correct the token text or tokenization (e.g. due to OCR/segmentation errors).
[neath](https://github.com/qurator-spk/neath) is developed at the [Berlin State Library](https://staatsbibliothek-berlin.de/) for data annotation in the context of the [SoNAR-IDH](https://sonar.fh-potsdam.de/) project and the [QURATOR](https://qurator.ai/) project.
[neat](https://github.com/qurator-spk/neat) is developed at the [Berlin State Library](https://staatsbibliothek-berlin.de/) for data annotation in the context of the [SoNAR-IDH](https://sonar.fh-potsdam.de/) project and the [QURATOR](https://qurator.ai/) project.
### 2. User Guide
#### 2.1 Technical Requirements
[neath](https://github.com/qurator-spk/neath) runs locally as a pure HTML+JavaScript webpage in your web browser. No software needs to be installed, but JavaScript has to be enabled in the browser.
[neat](https://github.com/qurator-spk/neat) runs locally as a pure HTML+JavaScript webpage in your web browser. No software needs to be installed, but JavaScript has to be enabled in the browser.
#### 2.2. Installation
Simply clone the repo using ``git clone https://github.com/qurator-spk/neath.git`` or download the [ZIP](https://github.com/qurator-spk/neath/archive/master.zip). Make sure you have at minimum ``neath.html`` and ``neath.js`` residing in a local directory, then it is sufficient to just open ``neath.html`` in a browser. Any fairly recent browser should work, but only Chrome and Firefox are tested.
Simply clone the repo using ``git clone https://github.com/qurator-spk/neat.git`` or download the [ZIP](https://github.com/qurator-spk/neat/archive/master.zip). Make sure you have at minimum ``neat.html`` and ``neat.js`` residing in a local directory, then it is sufficient to just open ``neat.html`` in a browser. Any fairly recent browser should work, but only Chrome and Firefox are tested.
#### 2.3 Data format
The data format is based on the format used in the [GermEval2014 Named Entity Recognition Shared Task](https://sites.google.com/site/germeval2014ner/data). Text is encoded as one token per line, with name spans encoded in the BIO-scheme, provided as tab-separated values:
@ -77,7 +77,7 @@ No. TOKEN NE-TAG NE-EMB
For our purposes we extend this format by adding
* a fifth column for an ``ID`` for the outer ``NE-TAG`` from an authority file (in this case, the [GND](https://www.dnb.de/EN/Professionell/Standardisierung/GND/gnd_node.html) is used)
* column six for use as a variable ``url_id`` (see [Image Support](https://github.com/qurator-spk/neath/blob/master/README.md#28-image-support) for further details)
* column six for use as a variable ``url_id`` (see [Image Support](https://github.com/qurator-spk/neat/blob/master/README.md#28-image-support) for further details)
* finally, columns 7+ are used for storing ``left,right,top,bottom`` pixel coordinates for facsimile snippets
Example (full):
@ -105,7 +105,7 @@ No. TOKEN NE-TAG NE-EMB GND-ID url_id left,right,top,bottom
```
#### 2.4 Data preparation
The source data that is used for annotation are OCR results in [PAGE-XML](https://github.com/PRImA-Research-Lab/PAGE-XML) format. We provide a [Python tool](https://github.com/qurator-spk/page2tsv) that supports the transformation of [PAGE-XML](https://github.com/PRImA-Research-Lab/PAGE-XML) OCR files into the [TSV format](https://github.com/qurator-spk/neath/blob/master/README.md#23-data-format) required for use with [neath](https://github.com/qurator-spk/neath).
The source data that is used for annotation are OCR results in [PAGE-XML](https://github.com/PRImA-Research-Lab/PAGE-XML) format. We provide a [Python tool](https://github.com/qurator-spk/page2tsv) that supports the transformation of [PAGE-XML](https://github.com/PRImA-Research-Lab/PAGE-XML) OCR files into the [TSV format](https://github.com/qurator-spk/neat/blob/master/README.md#23-data-format) required for use with [neat](https://github.com/qurator-spk/neat).
#### 2.5 Provenance
The processing pipeline applied at the Berlin State Library comprises the follows steps:
@ -115,7 +115,7 @@ Layout Analysis & Textline Extraction @[sbb_textline_detector](https://github.co
2. OCR & Word Segmentation
OCR is based on [OCR-D](https://github.com/OCR-D)'s [ocrd_tesserocr](https://github.com/OCR-D/ocrd_tesserocr) which requires [Tesseract](https://github.com/tesseract-ocr/tesseract) **>= 4.1.0**. The [GT4HistOCR_2000000](https://ub-backup.bib.uni-mannheim.de/~stweil/ocrd-train/data/GT4HistOCR_2000000.traineddata) model, which is [trained](https://github.com/tesseract-ocr/tesstrain/wiki/GT4HistOCR) on the [GT4HistOCR](https://zenodo.org/record/1344132) corpus, is used. Further details are available in the [paper](https://arxiv.org/abs/1809.05501).
3. TSV Transformation
A simple [Python tool](https://github.com/qurator-spk/page2tsv) is used for the transformation of the OCR results in [PAGE-XML](https://github.com/PRImA-Research-Lab/PAGE-XML) to [TSV](https://github.com/qurator-spk/neath/blob/master/docs/README.md#23-data-format).
A simple [Python tool](https://github.com/qurator-spk/page2tsv) is used for the transformation of the OCR results in [PAGE-XML](https://github.com/PRImA-Research-Lab/PAGE-XML) to [TSV](https://github.com/qurator-spk/neat/blob/master/docs/README.md#23-data-format).
4. Tokenization
For tokenization, [SoMaJo](https://github.com/tsproisl/SoMaJo) is used.
5. Named Entity Recognition
@ -180,13 +180,13 @@ For Named Entity Recognition, a [BERT-Base](https://github.com/google-research/b
* left-click the `POSITION` of a row and select `start-sentence` from the drop-down menu to start a new sentence
#### 2.8 Image Support
Provided facsimile images are available online via the [iiif.io](https://iiif.io/) Image API, [neath](https://github.com/qurator-spk/neath) supports the embedding of facsimile snippets into its interface to help with data annotation and correction.
Provided facsimile images are available online via the [iiif.io](https://iiif.io/) Image API, [neat](https://github.com/qurator-spk/neat) supports the embedding of facsimile snippets into its interface to help with data annotation and correction.
This further requires that OCR with word segmentation is applied to the image to determine bounding boxes for tokens.
The iiif-image-url contained in the source ``#`` can then be used as a replacement for ``url_id`` in combination with the token bounding boxes as ``left,right,top,bottom`` to obtain the facsimile snippet url and display the image in the leftmost column. Clicking on the facsimile snippet opens up a new tab with a larger context.
#### 2.9 Saving progress
[neath](https://github.com/qurator-spk/neath) runs fully locally in the browser. Therefore it can not automatically save any changes you made to disk. You have to use the `Save Changes` button in order to so manually from time to time. If your browser automatically saves all downloads to your `Downloads` folder, you might want to configure it so that it instead prompts you where to save.
[neat](https://github.com/qurator-spk/neat) runs fully locally in the browser. Therefore it can not automatically save any changes you made to disk. You have to use the `Save Changes` button in order to so manually from time to time. If your browser automatically saves all downloads to your `Downloads` folder, you might want to configure it so that it instead prompts you where to save.
### 3. Annotation Guidelines
The most recent version of the [Annotation Guidelines](https://github.com/qurator-spk/neath/blob/master/Annotation_Guidelines.pdf) is included in this repository.
The most recent version of the [Annotation Guidelines](https://github.com/qurator-spk/neat/blob/master/Annotation_Guidelines.pdf) is included in this repository.

@ -2,8 +2,8 @@
<html>
<head>
<meta charset="UTF-8">
<title>neath</title>
<base href="neath.html" target="_blank">
<title>neat</title>
<base href="neat.html" target="_blank">
<link rel="stylesheet" href="https://stackpath.bootstrapcdn.com/bootstrap/4.3.1/css/bootstrap.min.css"
integrity="sha384-ggOyR0iXCbMQv3Xipma34MD+dH/1fQ784/j6cY/iJTQUOhcWr7x9JvoRxT2MZw1T" crossorigin="anonymous">
<script type="text/javascript" src="https://cdnjs.cloudflare.com/ajax/libs/PapaParse/5.1.0/papaparse.min.js"></script>
@ -55,8 +55,8 @@
<div class="col-9">
<div class="row">
<div class="col text-center">
<h3><a href="https://github.com/qurator-spk/neath" target="_blank" tabindex="-1">neath</a>: named entity annotation tool</h3>
<a href="https://github.com/qurator-spk/neath/blob/master/README.md#2-user-guide" target="_blank" tabindex="-1">User Guide</a> | <a href="https://github.com/qurator-spk/neath/blob/master/Annotation_Guidelines.pdf" target="_blank" tabindex="-1">Annotation Guidelines</a> | <a href="https://github.com/qurator-spk/neath/issues" target="_blank" tabindex="-1">Issues</a><hr>
<h3><a href="https://github.com/qurator-spk/neat" target="_blank" tabindex="-1">neat</a>: named entity annotation tool</h3>
<a href="https://github.com/qurator-spk/neat/blob/master/README.md#2-user-guide" target="_blank" tabindex="-1">User Guide</a> | <a href="https://github.com/qurator-spk/neat/blob/master/Annotation_Guidelines.pdf" target="_blank" tabindex="-1">Annotation Guidelines</a> | <a href="https://github.com/qurator-spk/neat/issues" target="_blank" tabindex="-1">Issues</a><hr>
</div>
</div>
</div>
@ -70,7 +70,7 @@
</a>
</div>
<div class="col-8 text-center" id="tableregion">
Please upload a TSV<sup>(<a href="https://github.com/qurator-spk/neath/blob/master/User_Guide.md#22-data-format">i</a>)</sup> file:
Please upload a TSV<sup>(<a href="https://github.com/qurator-spk/neat/blob/master/User_Guide.md#22-data-format">i</a>)</sup> file:
<br><br>
<input type="file" id="tsv-file" name="files"/>
</div>
@ -88,6 +88,6 @@
</div>
<script src="neath.js"></script>
<script src="neat.js"></script>
</body>
</html>

@ -1009,14 +1009,14 @@ $(document).ready(
loadFile ( evt,
function(results, file, urls) {
let neath = setupInterface(results, file, urls);
let neat = setupInterface(results, file, urls);
$(window).bind("beforeunload",
function() {
console.log(neath.hasChanges());
console.log(neat.hasChanges());
if (neath.hasChanges())
if (neat.hasChanges())
return confirm("You have unsaved changes. Do you want to save them before leaving?");
}
);
Loading…
Cancel
Save