mirror of
https://github.com/qurator-spk/neat.git
synced 2025-06-09 03:39:59 +02:00
Update Preprocessing.md
This commit is contained in:
parent
4bc3351d37
commit
97d444e7dc
1 changed files with 8 additions and 5 deletions
|
@ -20,7 +20,7 @@ OCR is based on [OCR-D](https://github.com/OCR-D)'s [ocrd_tesserocr](https://git
|
||||||
|
|
||||||
``OUTPUT``: [PAGE-XML](https://github.com/PRImA-Research-Lab/PAGE-XML) file with bounding boxes for words and the contained text
|
``OUTPUT``: [PAGE-XML](https://github.com/PRImA-Research-Lab/PAGE-XML) file with bounding boxes for words and the contained text
|
||||||
|
|
||||||
### Tokenization
|
### TSV Transformation
|
||||||
|
|
||||||
A simple [Python tool](https://github.com/qurator-spk/neath/tree/master/tools) is used for the [transformation](https://github.com/qurator-spk/neath/tree/master/tools) of [PAGE-XML](https://github.com/PRImA-Research-Lab/PAGE-XML) to [TSV](https://github.com/qurator-spk/neath/blob/master/docs/User_Guide.md#data-format).
|
A simple [Python tool](https://github.com/qurator-spk/neath/tree/master/tools) is used for the [transformation](https://github.com/qurator-spk/neath/tree/master/tools) of [PAGE-XML](https://github.com/PRImA-Research-Lab/PAGE-XML) to [TSV](https://github.com/qurator-spk/neath/blob/master/docs/User_Guide.md#data-format).
|
||||||
|
|
||||||
|
@ -28,10 +28,13 @@ A simple [Python tool](https://github.com/qurator-spk/neath/tree/master/tools) i
|
||||||
|
|
||||||
``OUTPUT``: [TSV](https://github.com/qurator-spk/neath/blob/master/docs/User_Guide.md#data-format) file in the desired format for [neath](https://github.com/qurator-spk/neath)
|
``OUTPUT``: [TSV](https://github.com/qurator-spk/neath/blob/master/docs/User_Guide.md#data-format) file in the desired format for [neath](https://github.com/qurator-spk/neath)
|
||||||
|
|
||||||
Some postprocessing is then applied to the derived [TSV](https://github.com/qurator-spk/neath/blob/master/docs/User_Guide.md#data-format) file:
|
### Tokenization
|
||||||
* replace ``„`` and ``“`` with ``"``
|
|
||||||
* detect sentence boundaries and mark them with a newline and leading ``0``
|
For tokenization, [SoMaJo](https://github.com/tsproisl/SoMaJo) is used.
|
||||||
* detect punctuation (``.``, ``,``, ``;``, ``:``, ``!``, ``?``) and split it from the adjacent string
|
|
||||||
|
``INPUT ``: [TSV](https://github.com/qurator-spk/neath/blob/master/docs/User_Guide.md#data-format) file in the desired format for [neath](https://github.com/qurator-spk/neath)
|
||||||
|
|
||||||
|
``OUTPUT``: [TSV](https://github.com/qurator-spk/neath/blob/master/docs/User_Guide.md#data-format) file with tokenization
|
||||||
|
|
||||||
### Named Entity Recognition
|
### Named Entity Recognition
|
||||||
|
|
||||||
|
|
Loading…
Add table
Add a link
Reference in a new issue