You cannot select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.

75 lines
1.7 KiB
Markdown

5 years ago
# TSV - Processing Tools
## Installation:
Setup virtual environment:
```
virtualenv --python=python3.6 venv
```
Activate virtual environment:
```
source venv/bin/activate
```
Upgrade pip:
```
pip install -U pip
```
Install package together with its dependencies in development mode:
```
pip install -e ./
```
## PAGE-XML to TSV Transformation:
5 years ago
Create a TSV file from OCR in PAGE-XML format (with word segmentation):
5 years ago
```
page2tsv PAGE1.xml PAGE.tsv --image-url=http://link-to-corresponding-image-1
5 years ago
```
In order to create a TSV file for multiple PAGE XML files just perform successive calls
of the tool using the same TSV file:
5 years ago
```
page2tsv PAGE1.xml PAGE.tsv --image-url=http://link-to-corresponding-image-1
page2tsv PAGE2.xml PAGE.tsv --image-url=http://link-to-corresponding-image-2
page2tsv PAGE3.xml PAGE.tsv --image-url=http://link-to-corresponding-image-3
page2tsv PAGE4.xml PAGE.tsv --image-url=http://link-to-corresponding-image-4
page2tsv PAGE5.xml PAGE.tsv --image-url=http://link-to-corresponding-image-5
...
...
...
5 years ago
```
A corresponding URL-mapping file can be obtained from:
```
extract-doc-links PAGE.tsv PAGE-urls.tsv
```
5 years ago
By loading the annotated TSV as well as the url mapping file into
ner.edith, you will be able to jump directly to the original image
5 years ago
where the full text has been extracted from.
---
5 years ago
## Processing of already existing TSV files:
5 years ago
Create a URL-annotated TSV file from an existing TSV file:
5 years ago
```
annotate-tsv enp_DE.tsv enp_DE-annotated.tsv
```
Create a corresponding URL-mapping file:
```
extract-doc-links enp_DE.tsv enp_DE-urls.tsv
5 years ago
```
By loading the annotated TSV as well as the url mapping file into
ner.edith, you will be able to jump directly to the original image
where the full text has been extracted from.