From 9f2f3ba2d56e62cdcb74f5b3ea09d150579cddb6 Mon Sep 17 00:00:00 2001 From: Clemens Neudecker Date: Wed, 20 Nov 2019 00:59:01 +0100 Subject: [PATCH] Update Preprocessing.md --- docs/Preprocessing.md | 6 ++---- 1 file changed, 2 insertions(+), 4 deletions(-) diff --git a/docs/Preprocessing.md b/docs/Preprocessing.md index bb91fe6..b5e4b1f 100644 --- a/docs/Preprocessing.md +++ b/docs/Preprocessing.md @@ -10,12 +10,10 @@ Layout Analysis & Textline Extraction @[sbb_pixelwise_segmentation](https://gith ### OCR & Word Segmentation -OCR is based on [OCR-D](https://github.com/OCR-D)'s [ocrd_tesserocr](https://github.com/OCR-D/ocrd_tesserocr) which requires [Tesseract](https://github.com/tesseract-ocr/tesseract) **>= 4.1.0**. The [Fraktur_5000000](https://ub-backup.bib.uni-mannheim.de/~stweil/ocrd-train/data/Fraktur_5000000/) model, which is trained on [GT4HistOCR](https://github.com/tesseract-ocr/tesstrain/wiki/GT4HistOCR) is used. +OCR is based on [OCR-D](https://github.com/OCR-D)'s [ocrd_tesserocr](https://github.com/OCR-D/ocrd_tesserocr) which requires [Tesseract](https://github.com/tesseract-ocr/tesseract) **>= 4.1.0**. The [GT4HistOCR_2000000](https://ub-backup.bib.uni-mannheim.de/~stweil/ocrd-train/data/GT4HistOCR_2000000.traineddata) model, which is [trained](https://github.com/tesseract-ocr/tesstrain/wiki/GT4HistOCR) on the [GT4HistOCR](https://zenodo.org/record/1344132) corpus, is used. Further details are available in the [paper](https://arxiv.org/abs/1809.05501). ### Tokenization ### Named Entity Recognition -For Named Entity Recognition, a [BERT-Base](https://github.com/google-research/bert) model was trained for noisy OCR texts with historical spelling variation. - -[sbb_ner](https://github.com/qurator-spk/sbb_ner) is using a combination of unsupervised training on a large (~2.3m pages) [corpus of German OCR](https://zenodo.org/record/3257041) from the Digital Collections of the Berlin State Library in combination with supervised training on a small (47k tokens) [annotated corpus](https://github.com/EuropeanaNewspapers/ner-corpora/tree/master/enp_DE.sbb.bio) of OCR from digitized historical newspapers of the Berlin State Library. Further details are available in the [paper](https://corpora.linguistik.uni-erlangen.de/data/konvens/proceedings/papers/KONVENS2019_paper_4.pdf). +For Named Entity Recognition, a [BERT-Base](https://github.com/google-research/bert) model was trained for noisy OCR texts with historical spelling variation. [sbb_ner](https://github.com/qurator-spk/sbb_ner) is using a combination of unsupervised training on a large (~2.3m pages) [corpus of German OCR](https://zenodo.org/record/3257041) in combination with supervised training on a small (47k tokens) [annotated corpus](https://github.com/EuropeanaNewspapers/ner-corpora/tree/master/enp_DE.sbb.bio). Further details are available in the [paper](https://corpora.linguistik.uni-erlangen.de/data/konvens/proceedings/papers/KONVENS2019_paper_4.pdf).