1
0
Fork 0
mirror of https://github.com/qurator-spk/neat.git synced 2025-07-15 21:39:53 +02:00
neat/tools/README.md

51 lines
881 B
Markdown
Raw Normal View History

2019-10-02 15:01:15 +02:00
# TSV - Processing Tools
## Installation:
Setup virtual environment:
```
virtualenv --python=python3.6 venv
```
Activate virtual environment:
```
source venv/bin/activate
```
Upgrade pip:
```
pip install -U pip
```
Install package together with its dependencies in development mode:
```
pip install -e ./
```
## Usage:
Create a URL-annotated TSV file from an existing TSV file:
```
annotate-tsv enp_DE.tsv enp_DE-annotated.tsv
```
Create a corresponding URL-mapping file:
```
extract-doc-links enp_DE.tsv enp_DE-urls.tsv
```
By loading the annotated TSV as well as the url mapping file into
ner.edith, you will be able to jump directly to the original image
2019-10-30 19:08:12 +01:00
where the full text has been extracted from.
# PAGE-XML to TSV Transformation
## Usage:
Create a TSV file from OCR in PAGE-XML format (with word segmentation):
```
python page2tsv.py PAGE.xml > PAGE.tsv
```