1.7 KiB
Preprocessing
The preprocessing pipeline that is developed at the Berlin State Library comprises the following steps:
Layout Analysis & Textline Extraction
Layout Analysis & Textline Extraction @sbb_pixelwise_segmentation
OCR & Word Segmentation
OCR is based on OCR-D's ocrd_tesserocr which requires Tesseract >= 4.1.0. The Fraktur_5000000 model, which is trained on GT4HistOCR is used. corpiu The PAGE-XML produced by the Layout Analysis & Textline Extraction is taken as input, and the output is PAGE-XML containing the text recognition results with absolute pixel coordinates describing bounding boxes for words.
Tokenization
Named Entity Recognition
For Named Entity Recognition, a BERT-Base model was trained. sbb_ner is using a combination of unsupervised training on a large (~2.3m pages) OCR corpus in combination with supervised training on a small (50k tokens) annotated corpus. Further details are available in the paper.