|
|
|
# TSV - Processing Tools
|
|
|
|
|
|
|
|
Create .tsv files that can be viewed and edited with [neat](https://github.com/qurator-spk/neat).
|
|
|
|
|
|
|
|
## Installation:
|
|
|
|
|
|
|
|
Required python version is 3.11.
|
|
|
|
Consider use of [pyenv](https://github.com/pyenv/pyenv) if that python version is not available on your system.
|
|
|
|
|
|
|
|
Activate virtual environment (virtualenv):
|
|
|
|
```
|
|
|
|
source venv/bin/activate
|
|
|
|
```
|
|
|
|
or (pyenv):
|
|
|
|
```
|
|
|
|
pyenv activate my-python-3.11-virtualenv
|
|
|
|
```
|
|
|
|
|
|
|
|
Update pip:
|
|
|
|
```
|
|
|
|
pip install -U pip
|
|
|
|
```
|
|
|
|
Install sbb_images:
|
|
|
|
```
|
|
|
|
pip install git+https://github.com/qurator-spk/page2tsv.git
|
|
|
|
```
|
|
|
|
|
|
|
|
## PAGE-XML to TSV Transformation:
|
|
|
|
|
|
|
|
Create a TSV file from OCR in PAGE-XML format (with word segmentation):
|
|
|
|
|
|
|
|
```
|
|
|
|
page2tsv PAGE1.xml PAGE.tsv --image-url=http://link-to-corresponding-image-1
|
|
|
|
```
|
|
|
|
|
|
|
|
In order to create a TSV file for multiple PAGE XML files just perform successive calls
|
|
|
|
of the tool using the same TSV file:
|
|
|
|
|
|
|
|
```
|
|
|
|
page2tsv PAGE1.xml PAGE.tsv --image-url=http://link-to-corresponding-image-1
|
|
|
|
page2tsv PAGE2.xml PAGE.tsv --image-url=http://link-to-corresponding-image-2
|
|
|
|
page2tsv PAGE3.xml PAGE.tsv --image-url=http://link-to-corresponding-image-3
|
|
|
|
page2tsv PAGE4.xml PAGE.tsv --image-url=http://link-to-corresponding-image-4
|
|
|
|
page2tsv PAGE5.xml PAGE.tsv --image-url=http://link-to-corresponding-image-5
|
|
|
|
...
|
|
|
|
...
|
|
|
|
...
|
|
|
|
```
|
|
|
|
|
|
|
|
For instance, for the file [example.xml](https://github.com/qurator-spk/page2tsv/blob/master/example.xml):
|
|
|
|
|
|
|
|
```
|
|
|
|
page2tsv example.xml example.tsv --image-url=http://content.staatsbibliothek-berlin.de/zefys/SNP27646518-18800101-0-3-0-0/left,top,width,height/full/0/default.jpg
|
|
|
|
```
|
|
|
|
|
|
|
|
---
|
|
|
|
|
|
|
|
## Processing of already existing TSV files:
|
|
|
|
|
|
|
|
Create a URL-annotated TSV file from an existing TSV file:
|
|
|
|
|
|
|
|
```
|
|
|
|
annotate-tsv enp_DE.tsv enp_DE-annotated.tsv
|
|
|
|
```
|
|
|
|
|
|
|
|
# Command-line interface:
|
|
|
|
|
|
|
|
```
|
|
|
|
page2tsv [OPTIONS] PAGE_XML_FILE TSV_OUT_FILE
|
|
|
|
|
|
|
|
Options:
|
|
|
|
--purpose [NERD|OCR] Purpose of output tsv file.
|
|
|
|
|
|
|
|
NERD: NER/NED application/ground-truth creation.
|
|
|
|
|
|
|
|
OCR: OCR application/ground-truth creation.
|
|
|
|
|
|
|
|
default: NERD.
|
|
|
|
--image-url TEXT
|
|
|
|
--ner-rest-endpoint TEXT REST endpoint of sbb_ner service. See
|
|
|
|
https://github.com/qurator-spk/sbb_ner for
|
|
|
|
details. Only applicable in case of NERD.
|
|
|
|
--ned-rest-endpoint TEXT REST endpoint of sbb_ned service. See
|
|
|
|
https://github.com/qurator-spk/sbb_ned for
|
|
|
|
details. Only applicable in case of NERD.
|
|
|
|
--noproxy disable proxy. default: enabled.
|
|
|
|
--scale-factor FLOAT default: 1.0
|
|
|
|
--ned-threshold FLOAT
|
|
|
|
--min-confidence FLOAT
|
|
|
|
--max-confidence FLOAT
|
|
|
|
--ned-priority INTEGER
|
|
|
|
--help Show this message and exit.
|
|
|
|
|
|
|
|
```
|