mirror of https://github.com/qurator-spk/neat.git
You cannot select more than 25 topics
Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
10 lines
498 B
Markdown
10 lines
498 B
Markdown
5 years ago
|
# Preprocessing
|
||
|
|
||
|
The preprocessing pipeline that is developed at the
|
||
|
[Berlin State Library](http://staatsbibliothek-berlin.de/)
|
||
|
comprises the following steps:
|
||
|
- textline extraction @[sbb_pixelwise_segmentation](https://github.com/qurator-spk/pixelwise_segmentation_SBB)
|
||
|
- word segmentation @[ocrd_tesserocr](https://github.com/OCR-D/ocrd_tesserocr)
|
||
|
- OCR @[ocrd_calamari](https://github.com/qurator-spk/ocrd_calamari)
|
||
|
- Tokenization
|
||
|
- Pretagging @[sbb_ner](https://github.com/qurator-spk/sbb_ner)
|