You cannot select more than 25 topics
Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
|
1 week ago | |
---|---|---|
qurator | 1 week ago | |
tests | 3 years ago | |
.gitignore | 2 years ago | |
.gitmodules | 3 years ago | |
CHANGELOG.md | 3 years ago | |
LICENSE | 5 years ago | |
Makefile | 3 years ago | |
README.md | 1 week ago | |
__init__.py | 4 years ago | |
example.xml | 5 years ago | |
ocrd-tool.json | 3 years ago | |
requirements-test.txt | 3 years ago | |
requirements.txt | 1 week ago | |
setup.py | 6 months ago |
README.md
TSV - Processing Tools
Create .tsv files that can be viewed and edited with neat.
Installation:
Required python version is 3.11. Consider use of pyenv if that python version is not available on your system.
Activate virtual environment (virtualenv):
source venv/bin/activate
or (pyenv):
pyenv activate my-python-3.11-virtualenv
Update pip:
pip install -U pip
Install tsvtools:
pip install git+https://github.com/qurator-spk/page2tsv.git
PAGE-XML to TSV Transformation:
Create a TSV file from OCR in PAGE-XML format (with word segmentation):
page2tsv PAGE1.xml PAGE.tsv --image-url=http://link-to-corresponding-image-1
In order to create a TSV file for multiple PAGE XML files just perform successive calls of the tool using the same TSV file:
page2tsv PAGE1.xml PAGE.tsv --image-url=http://link-to-corresponding-image-1
page2tsv PAGE2.xml PAGE.tsv --image-url=http://link-to-corresponding-image-2
page2tsv PAGE3.xml PAGE.tsv --image-url=http://link-to-corresponding-image-3
page2tsv PAGE4.xml PAGE.tsv --image-url=http://link-to-corresponding-image-4
page2tsv PAGE5.xml PAGE.tsv --image-url=http://link-to-corresponding-image-5
...
...
...
For instance, for the file example.xml:
page2tsv example.xml example.tsv --image-url=http://content.staatsbibliothek-berlin.de/zefys/SNP27646518-18800101-0-3-0-0/left,top,width,height/full/0/default.jpg
Processing of already existing TSV files:
Create a URL-annotated TSV file from an existing TSV file:
annotate-tsv enp_DE.tsv enp_DE-annotated.tsv
Command-line interface:
page2tsv --help
Usage: page2tsv [OPTIONS] PAGE_XML_FILE TSV_OUT_FILE
Converts a page-XML file into a TSV file that can be edited with neat.
Optionally the tool also accepts NER and Entitiy Linking API-Endpoints as
parameters and performs NER and EL and the document if these are provided.
PAGE_XML_FILE: The source page-XML file. TSV_OUT_FILE: Resulting TSV file.
Options:
--purpose [NERD|OCR] Purpose of output tsv file.
NERD: NER/NED application/ground-truth creation.
OCR: OCR application/ground-truth creation.
default: NERD.
--image-url TEXT An image retrieval link that enables neat to show
the scan images corresponding to the text tokens.
Example: https://content.staatsbibliothek-berlin.
de/zefys/SNP26824620-18371109-0-1-0-0/left,top,wi
dth,height/full/0/default.jpg
--ner-rest-endpoint TEXT REST endpoint of sbb_ner service. See
https://github.com/qurator-spk/sbb_ner for
details. Only applicable in case of NERD.
--ned-rest-endpoint TEXT REST endpoint of sbb_ned service. See
https://github.com/qurator-spk/sbb_ned for
details. Only applicable in case of NERD.
--noproxy disable proxy. default: enabled.
--scale-factor FLOAT default: 1.0
--ned-threshold FLOAT
--min-confidence FLOAT
--max-confidence FLOAT
--ned-priority INTEGER
--normalization-file PATH
--help Show this message and exit.
tsv2tsv --help
Usage: tsv2tsv [OPTIONS] TSV_IN_FILE
Options:
--tsv-out-file PATH Write modified TSV to this file.
--ner-rest-endpoint TEXT REST endpoint of sbb_ner service. See
https://github.com/qurator-spk/sbb_ner for
details.
--noproxy disable proxy. default: enabled.
--num-tokens Print number of tokens in input/output file.
--sentence-count Print sentence count in input/output file.
--max-sentence-len Print maximum sentence len for input/output
file.
--keep-tokenization Keep the word tokenization exactly as it is.
--sentence-split-only Do only sentence splitting.
--show-urls Print contained visualization URLs.
--just-zero Process only files that have max sentence
length zero,i.e., that do not have sentence
splitting.
--sanitize-sentence-numbers Sanitize sentence numbering.
--show-columns Show TSV columns.
--drop-column TEXT Drop column
--help Show this message and exit.
alto2tsv --help
Usage: alto2tsv [OPTIONS] ALTO_XML_FILE TSV_OUT_FILE
Converts a ALTO-XML file into a TSV file that can be edited with neat.
Optionally the tool also accepts NER and Entitiy Linking API-Endpoints as
parameters and performs NER and EL and the document if these are provided.
ALTO_XML_FILE: The source ALTO-XML file.
TSV_OUT_FILE: Resulting TSV file.
Options:
--purpose [NERD|OCR] Purpose of output tsv file.
NERD: NER/NED application/ground-truth creation.
OCR: OCR application/ground-truth creation.
default: NERD.
--image-url TEXT An image retrieval link that enables neat to show
the scan images corresponding to the text tokens.
Example: https://content.staatsbibliothek-berlin.d
e/zefys/SNP26824620-18371109-0-1-0-0/left,top,widt
h,height/full/0/default.jpg
--ner-rest-endpoint TEXT REST endpoint of sbb_ner service. See
https://github.com/qurator-spk/sbb_ner for
details. Only applicable in case of NERD.
--ned-rest-endpoint TEXT REST endpoint of sbb_ned service. See
https://github.com/qurator-spk/sbb_ned for
details. Only applicable in case of NERD.
--noproxy disable proxy. default: enabled.
--scale-factor FLOAT default: 1.0
--ned-threshold FLOAT
--ned-priority INTEGER
--help Show this message and exit.