You cannot select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.

3.3 KiB

Preprocessing

The preprocessing pipeline that is developed at the Berlin State Library comprises the following steps:

Layout Analysis & Textline Extraction

Layout Analysis & Textline Extraction @sbb_pixelwise_segmentation

INPUT : image file

OUTPUT: PAGE-XML file with bounding boxes for regions and text lines

OCR & Word Segmentation

OCR is based on OCR-D's ocrd_tesserocr which requires Tesseract >= 4.1.0. The GT4HistOCR_2000000 model, which is trained on the GT4HistOCR corpus, is used. Further details are available in the paper.

INPUT : PAGE-XML file with bounding boxes for regions and text lines

OUTPUT: PAGE-XML file with bounding boxes for words and the contained text

Tokenization

A simple Python tool is used for the transformation of PAGE-XML to TSV.

INPUT : PAGE-XML file with bounding boxes for words and the contained text

OUTPUT: TSV file in the desired format for neath

Some postprocessing is then applied to the derived TSV file:

  • replace and with "
  • detect sentence boundaries and mark them with a newline and leading 0
  • detect punctuation (., ,, ;, :, !, ?) and split it from the adjacent string

Named Entity Recognition

For Named Entity Recognition, a BERT-Base model was trained for noisy OCR texts with historical spelling variation. sbb_ner is using a combination of unsupervised training on a large (~2.3m pages) corpus of German OCR in combination with supervised training on a small (47k tokens) annotated corpus. Further details are available in the paper.

INPUT : TSV file obtained after Tokenization and postprocessing

OUTPUT: TSV file with automatically recognized named entities added