daa9a2676e | 5 years ago | |
---|---|---|
.. | ||
README.md | 5 years ago | |
cli.py | 5 years ago | |
requirements.txt | 5 years ago | |
setup.py | 5 years ago |
README.md
TSV - Processing Tools
Installation:
Setup virtual environment:
virtualenv --python=python3.6 venv
Activate virtual environment:
source venv/bin/activate
Upgrade pip:
pip install -U pip
Install package together with its dependencies in development mode:
pip install -e ./
PAGE-XML to TSV Transformation:
Create a TSV file from OCR in PAGE-XML format (with word segmentation):
page2tsv PAGE1.xml PAGE.tsv --image-url=http://link-to-corresponding-image-1
In order to create a TSV file for multiple PAGE XML files just perform successive calls of the tool using the same TSV file:
page2tsv PAGE1.xml PAGE.tsv --image-url=http://link-to-corresponding-image-1
page2tsv PAGE2.xml PAGE.tsv --image-url=http://link-to-corresponding-image-2
page2tsv PAGE3.xml PAGE.tsv --image-url=http://link-to-corresponding-image-3
page2tsv PAGE4.xml PAGE.tsv --image-url=http://link-to-corresponding-image-4
page2tsv PAGE5.xml PAGE.tsv --image-url=http://link-to-corresponding-image-5
...
...
...
A corresponding URL-mapping file can be obtained from:
extract-doc-links PAGE.tsv PAGE-urls.tsv
By loading the annotated TSV as well as the url mapping file into ner.edith, you will be able to jump directly to the original image where the full text has been extracted from.
Processing of already existing TSV files:
Create a URL-annotated TSV file from an existing TSV file:
annotate-tsv enp_DE.tsv enp_DE-annotated.tsv
Create a corresponding URL-mapping file:
extract-doc-links enp_DE.tsv enp_DE-urls.tsv
By loading the annotated TSV as well as the url mapping file into ner.edith, you will be able to jump directly to the original image where the full text has been extracted from.