Merge branch 'master' of https://github.com/qurator-spk/neath

2026-07-29 06:32:29 +02:00 · 2019-11-20 18:34:59 +01:00 · 2019-11-20 18:34:59 +01:00 · cf83a53c09
commit cf83a53c09
parent f13f439826 6a49815db3
1 changed files with 5 additions and 1 deletions
--- a/docs/Preprocessing.md
+++ b/docs/Preprocessing.md
@ -14,7 +14,11 @@ OCR is based on [OCR-D](https://github.com/OCR-D)'s [ocrd_tesserocr](https://git

 ### Tokenization

-[Transformation](https://github.com/qurator-spk/neath/tree/master/tools) of [PAGE-XML](https://github.com/PRImA-Research-Lab/PAGE-XML) to [TSV](https://github.com/qurator-spk/neath/blob/master/docs/User_Guide.md#data-format).
+* [Transformation](https://github.com/qurator-spk/neath/tree/master/tools) of [PAGE-XML](https://github.com/PRImA-Research-Lab/PAGE-XML) to [TSV](https://github.com/qurator-spk/neath/blob/master/docs/User_Guide.md#data-format).
+* Postprocessing:
+  * replace ``„`` and ``“`` with ``"``
+  * sentence boundaries
+  * punctuation

 ### Named Entity Recognition