mirror of
https://github.com/qurator-spk/page2tsv.git
synced 2025-10-27 22:54:13 +01:00
No description
| tests | ||
| tsvtools | ||
| .gitignore | ||
| .gitmodules | ||
| __init__.py | ||
| example.xml | ||
| LICENSE | ||
| Makefile | ||
| ocrd-tool.json | ||
| README.md | ||
| requirements-test.txt | ||
| requirements.txt | ||
| setup.py | ||
TSV - Processing Tools
Installation:
Setup virtual environment:
virtualenv --python=python3.6 venv
Activate virtual environment:
source venv/bin/activate
Upgrade pip:
pip install -U pip
Install package together with its dependencies in development mode:
pip install -e ./
PAGE-XML to TSV Transformation:
Create a TSV file from OCR in PAGE-XML format (with word segmentation):
page2tsv PAGE1.xml PAGE.tsv --image-url=http://link-to-corresponding-image-1
In order to create a TSV file for multiple PAGE XML files just perform successive calls of the tool using the same TSV file:
page2tsv PAGE1.xml PAGE.tsv --image-url=http://link-to-corresponding-image-1
page2tsv PAGE2.xml PAGE.tsv --image-url=http://link-to-corresponding-image-2
page2tsv PAGE3.xml PAGE.tsv --image-url=http://link-to-corresponding-image-3
page2tsv PAGE4.xml PAGE.tsv --image-url=http://link-to-corresponding-image-4
page2tsv PAGE5.xml PAGE.tsv --image-url=http://link-to-corresponding-image-5
...
...
...
For instance, for the file example.xml:
page2tsv example.xml example.tsv --image-url=http://content.staatsbibliothek-berlin.de/zefys/SNP27646518-18800101-0-3-0-0/left,top,width,height/full/0/default.jpg
Processing of already existing TSV files:
Create a URL-annotated TSV file from an existing TSV file:
annotate-tsv enp_DE.tsv enp_DE-annotated.tsv