From 5ecc96864dc2de5ac2fe1dfe9ee5b6dbcc1a4ad1 Mon Sep 17 00:00:00 2001 From: Clemens Neudecker Date: Wed, 20 Nov 2019 18:54:01 +0100 Subject: [PATCH] Update Preprocessing.md --- docs/Preprocessing.md | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/docs/Preprocessing.md b/docs/Preprocessing.md index 03165e2..9a242db 100644 --- a/docs/Preprocessing.md +++ b/docs/Preprocessing.md @@ -22,7 +22,7 @@ OCR is based on [OCR-D](https://github.com/OCR-D)'s [ocrd_tesserocr](https://git ### Tokenization -A simple Python tool is used for the [transformation](https://github.com/qurator-spk/neath/tree/master/tools) of [PAGE-XML](https://github.com/PRImA-Research-Lab/PAGE-XML) to [TSV](https://github.com/qurator-spk/neath/blob/master/docs/User_Guide.md#data-format). +A simple [Python tool](https://github.com/qurator-spk/neath/tree/master/tools) is used for the [transformation](https://github.com/qurator-spk/neath/tree/master/tools) of [PAGE-XML](https://github.com/PRImA-Research-Lab/PAGE-XML) to [TSV](https://github.com/qurator-spk/neath/blob/master/docs/User_Guide.md#data-format). ``INPUT ``: [PAGE-XML](https://github.com/PRImA-Research-Lab/PAGE-XML) file with bounding boxes for words and the contained text @@ -30,8 +30,8 @@ A simple Python tool is used for the [transformation](https://github.com/qurator Some postprocessing is then applied to the derived [TSV](https://github.com/qurator-spk/neath/blob/master/docs/User_Guide.md#data-format) file: * replace ``„`` and ``“`` with ``"`` - * detect sentence boundaries and mark them with a leading ``0`` - * detect punctuation and split it from the adjacent string + * detect sentence boundaries and mark them with a newline and leading ``0`` + * detect punctuation (``.``, ``,``, ``;``, ``:``, ``!``, ``?``) and split it from the adjacent string ### Named Entity Recognition