From 6a49815db32a8e85159c57800ead15bb3c7de6be Mon Sep 17 00:00:00 2001 From: Clemens Neudecker Date: Wed, 20 Nov 2019 18:17:33 +0100 Subject: [PATCH] Update Preprocessing.md --- docs/Preprocessing.md | 6 +++++- 1 file changed, 5 insertions(+), 1 deletion(-) diff --git a/docs/Preprocessing.md b/docs/Preprocessing.md index f249996..09cab15 100644 --- a/docs/Preprocessing.md +++ b/docs/Preprocessing.md @@ -14,7 +14,11 @@ OCR is based on [OCR-D](https://github.com/OCR-D)'s [ocrd_tesserocr](https://git ### Tokenization -[Transformation](https://github.com/qurator-spk/neath/tree/master/tools) of [PAGE-XML](https://github.com/PRImA-Research-Lab/PAGE-XML) to [TSV](https://github.com/qurator-spk/neath/blob/master/docs/User_Guide.md#data-format). +* [Transformation](https://github.com/qurator-spk/neath/tree/master/tools) of [PAGE-XML](https://github.com/PRImA-Research-Lab/PAGE-XML) to [TSV](https://github.com/qurator-spk/neath/blob/master/docs/User_Guide.md#data-format). +* Postprocessing: + * replace ``„`` and ``“`` with ``"`` + * sentence boundaries + * punctuation ### Named Entity Recognition