page2tsv/README.md

# TSV - Processing Tools

Create .tsv files that can be viewed and edited with [neat](https://github.com/qurator-spk/neat).

## Installation:

Required python version is 3.11.
Consider use of [pyenv](https://github.com/pyenv/pyenv) if that python version is not available on your system.

Activate virtual environment (virtualenv):
```
source venv/bin/activate
```
or (pyenv):
```
pyenv activate my-python-3.11-virtualenv
```

Update pip:
```
pip install -U pip
```
Install tsvtools:
```
pip install git+https://github.com/qurator-spk/page2tsv.git
```

## PAGE-XML to TSV Transformation:

Create a TSV file from OCR in PAGE-XML format (with word segmentation):

```
page2tsv PAGE1.xml PAGE.tsv --image-url=http://link-to-corresponding-image-1
```

In order to create a TSV file for multiple PAGE XML files just perform successive calls
of the tool using the same TSV file:

```
page2tsv PAGE1.xml PAGE.tsv --image-url=http://link-to-corresponding-image-1
page2tsv PAGE2.xml PAGE.tsv --image-url=http://link-to-corresponding-image-2
page2tsv PAGE3.xml PAGE.tsv --image-url=http://link-to-corresponding-image-3
page2tsv PAGE4.xml PAGE.tsv --image-url=http://link-to-corresponding-image-4
page2tsv PAGE5.xml PAGE.tsv --image-url=http://link-to-corresponding-image-5
...
...
...
```

For instance, for the file [example.xml](https://github.com/qurator-spk/page2tsv/blob/master/example.xml):

```
page2tsv example.xml example.tsv --image-url=http://content.staatsbibliothek-berlin.de/zefys/SNP27646518-18800101-0-3-0-0/left,top,width,height/full/0/default.jpg
```

---

## Processing of already existing TSV files:

Create a URL-annotated TSV file from an existing TSV file:

```
annotate-tsv enp_DE.tsv enp_DE-annotated.tsv
```

# Command-line interface:

```
page2tsv --help
Usage: page2tsv [OPTIONS] PAGE_XML_FILE TSV_OUT_FILE

  Converts a page-XML file into a TSV file that can be edited with neat.
  Optionally the tool also accepts NER and Entitiy Linking API-Endpoints as
  parameters and performs NER and EL and the document if these are provided.

  PAGE_XML_FILE: The source page-XML file. TSV_OUT_FILE: Resulting TSV file.

Options:
  --purpose [NERD|OCR]       Purpose of output tsv file.

                             NERD: NER/NED application/ground-truth creation.

                             OCR: OCR application/ground-truth creation.

                             default: NERD.
  --image-url TEXT           An image retrieval link that enables neat to show
                             the scan images corresponding to the text tokens.
                             Example: https://content.staatsbibliothek-berlin.
                             de/zefys/SNP26824620-18371109-0-1-0-0/left,top,wi
                             dth,height/full/0/default.jpg
  --ner-rest-endpoint TEXT   REST endpoint of sbb_ner service. See
                             https://github.com/qurator-spk/sbb_ner for
                             details. Only applicable in case of NERD.
  --ned-rest-endpoint TEXT   REST endpoint of sbb_ned service. See
                             https://github.com/qurator-spk/sbb_ned for
                             details. Only applicable in case of NERD.
  --noproxy                  disable proxy. default: enabled.
  --scale-factor FLOAT       default: 1.0
  --ned-threshold FLOAT
  --min-confidence FLOAT
  --max-confidence FLOAT
  --ned-priority INTEGER
  --normalization-file PATH
  --help                     Show this message and exit.
```

```
tsv2tsv --help
Usage: tsv2tsv [OPTIONS] TSV_IN_FILE

Options:
  --tsv-out-file PATH          Write modified TSV to this file.
  --ner-rest-endpoint TEXT     REST endpoint of sbb_ner service. See
                               https://github.com/qurator-spk/sbb_ner for
                               details.
  --noproxy                    disable proxy. default: enabled.
  --num-tokens                 Print number of tokens in input/output file.
  --sentence-count             Print sentence count in input/output file.
  --max-sentence-len           Print maximum sentence len for input/output
                               file.
  --keep-tokenization          Keep the word tokenization exactly as it is.
  --sentence-split-only        Do only sentence splitting.
  --show-urls                  Print contained visualization URLs.
  --just-zero                  Process only files that have max sentence
                               length zero,i.e., that do not have sentence
                               splitting.
  --sanitize-sentence-numbers  Sanitize sentence numbering.
  --show-columns               Show TSV columns.
  --drop-column TEXT           Drop column
  --help                       Show this message and exit.
```

```
alto2tsv --help
Usage: alto2tsv [OPTIONS] ALTO_XML_FILE TSV_OUT_FILE

  Converts a ALTO-XML file into a TSV file that can be edited with neat.
  Optionally the tool also accepts NER and Entitiy Linking API-Endpoints as
  parameters and performs NER and EL and the document if these are provided.

  ALTO_XML_FILE: The source ALTO-XML file.
  TSV_OUT_FILE: Resulting TSV file.

Options:
  --purpose [NERD|OCR]      Purpose of output tsv file.

                            NERD: NER/NED application/ground-truth creation.

                            OCR: OCR application/ground-truth creation.

                            default: NERD.
  --image-url TEXT          An image retrieval link that enables neat to show
                            the scan images corresponding to the text tokens.
                            Example: https://content.staatsbibliothek-berlin.d
                            e/zefys/SNP26824620-18371109-0-1-0-0/left,top,widt
                            h,height/full/0/default.jpg
  --ner-rest-endpoint TEXT  REST endpoint of sbb_ner service. See
                            https://github.com/qurator-spk/sbb_ner for
                            details. Only applicable in case of NERD.
  --ned-rest-endpoint TEXT  REST endpoint of sbb_ned service. See
                            https://github.com/qurator-spk/sbb_ned for
                            details. Only applicable in case of NERD.
  --noproxy                 disable proxy. default: enabled.
  --scale-factor FLOAT      default: 1.0
  --ned-threshold FLOAT
  --ned-priority INTEGER
  --help                    Show this message and exit.
```