3.3 KiB
Preprocessing
The preprocessing pipeline that is developed at the Berlin State Library comprises the following steps:
Layout Analysis & Textline Extraction
Layout Analysis & Textline Extraction @sbb_textline_detector
INPUT
: image file
OUTPUT
: PAGE-XML file with bounding boxes for regions and text lines
OCR & Word Segmentation
OCR is based on OCR-D's ocrd_tesserocr which requires Tesseract >= 4.1.0. The GT4HistOCR_2000000 model, which is trained on the GT4HistOCR corpus, is used. Further details are available in the paper.
INPUT
: PAGE-XML file with bounding boxes for regions and text lines
OUTPUT
: PAGE-XML file with bounding boxes for words and the contained text
TSV Transformation
A simple Python tool is used for the transformation of PAGE-XML to TSV.
INPUT
: PAGE-XML file with bounding boxes for words and the contained text
OUTPUT
: TSV file in the desired format for neath
Tokenization
For tokenization, SoMaJo is used.
INPUT
: TSV file in the desired format for neath
OUTPUT
: TSV file with tokenization
Named Entity Recognition
For Named Entity Recognition, a BERT-Base model was trained for noisy OCR texts with historical spelling variation. sbb_ner is using a combination of unsupervised training on a large (~2.3m pages) corpus of German OCR in combination with supervised training on a small (47k tokens) annotated corpus. Further details are available in the paper.
INPUT
: TSV file obtained after Tokenization and postprocessing
OUTPUT
: TSV file with automatically recognized named entities added