1
0
Fork 0
mirror of https://github.com/qurator-spk/page2tsv.git synced 2025-06-15 22:39:54 +02:00
No description
Find a file
2025-04-23 15:05:15 +02:00
qurator Update cli help text. 2025-04-23 15:05:15 +02:00
tests install into qurator namespace 2022-11-08 16:19:23 +01:00
.gitignore .gitignore: Ignore build directory 2023-10-22 19:15:35 +02:00
.gitmodules drop support for scaling, not necessary for SBB use case anymore 2022-05-30 16:58:20 +02:00
__init__.py add OCR annotation functionality 2021-02-01 16:25:12 +01:00
CHANGELOG.md 📦 v0.0.1 2022-11-09 16:01:44 +01:00
example.xml add example.xml PAGE-XML 2019-12-16 16:40:39 +01:00
LICENSE Initial commit 2019-12-16 16:36:36 +01:00
Makefile drop support for scaling, not necessary for SBB use case anymore 2022-05-30 16:58:20 +02:00
ocrd-tool.json install into qurator namespace 2022-11-08 16:19:23 +01:00
README.md Update required python version;update README 2025-04-23 14:59:21 +02:00
requirements-test.txt drop support for scaling, not necessary for SBB use case anymore 2022-05-30 16:58:20 +02:00
requirements.txt Update required python version;update README 2025-04-23 14:59:21 +02:00
setup.py add tsv2tsv tool;make easy re-processing of tsv files possible 2024-11-12 14:41:08 +01:00

TSV - Processing Tools

Create .tsv files that can be viewed and edited with neat.

Installation:

Required python version is 3.11. Consider use of pyenv if that python version is not available on your system.

Activate virtual environment (virtualenv):

source venv/bin/activate

or (pyenv):

pyenv activate my-python-3.11-virtualenv

Update pip:

pip install -U pip

Install sbb_images:

pip install git+https://github.com/qurator-spk/page2tsv.git

PAGE-XML to TSV Transformation:

Create a TSV file from OCR in PAGE-XML format (with word segmentation):

page2tsv PAGE1.xml PAGE.tsv --image-url=http://link-to-corresponding-image-1

In order to create a TSV file for multiple PAGE XML files just perform successive calls of the tool using the same TSV file:

page2tsv PAGE1.xml PAGE.tsv --image-url=http://link-to-corresponding-image-1
page2tsv PAGE2.xml PAGE.tsv --image-url=http://link-to-corresponding-image-2
page2tsv PAGE3.xml PAGE.tsv --image-url=http://link-to-corresponding-image-3
page2tsv PAGE4.xml PAGE.tsv --image-url=http://link-to-corresponding-image-4
page2tsv PAGE5.xml PAGE.tsv --image-url=http://link-to-corresponding-image-5
...
...
...

For instance, for the file example.xml:

page2tsv example.xml example.tsv --image-url=http://content.staatsbibliothek-berlin.de/zefys/SNP27646518-18800101-0-3-0-0/left,top,width,height/full/0/default.jpg

Processing of already existing TSV files:

Create a URL-annotated TSV file from an existing TSV file:

annotate-tsv enp_DE.tsv enp_DE-annotated.tsv

Command-line interface:

page2tsv [OPTIONS] PAGE_XML_FILE TSV_OUT_FILE

Options:
  --purpose [NERD|OCR]      Purpose of output tsv file.
                            
                            NERD: NER/NED application/ground-truth creation.
                            
                            OCR: OCR application/ground-truth creation.
                            
                            default: NERD.
  --image-url TEXT
  --ner-rest-endpoint TEXT  REST endpoint of sbb_ner service. See
                            https://github.com/qurator-spk/sbb_ner for
                            details. Only applicable in case of NERD.
  --ned-rest-endpoint TEXT  REST endpoint of sbb_ned service. See
                            https://github.com/qurator-spk/sbb_ned for
                            details. Only applicable in case of NERD.
  --noproxy                 disable proxy. default: enabled.
  --scale-factor FLOAT      default: 1.0
  --ned-threshold FLOAT
  --min-confidence FLOAT
  --max-confidence FLOAT
  --ned-priority INTEGER
  --help                    Show this message and exit.