From 860a3c45f017154c4775610e87fc0e93835360a3 Mon Sep 17 00:00:00 2001 From: cneud Date: Tue, 19 Nov 2019 23:32:29 +0100 Subject: [PATCH] Create Preprocessing.md --- docs/Preprocessing.md | 10 ++++++++++ 1 file changed, 10 insertions(+) create mode 100644 docs/Preprocessing.md diff --git a/docs/Preprocessing.md b/docs/Preprocessing.md new file mode 100644 index 0000000..8a7dc9c --- /dev/null +++ b/docs/Preprocessing.md @@ -0,0 +1,10 @@ +# Preprocessing + +The preprocessing pipeline that is developed at the +[Berlin State Library](http://staatsbibliothek-berlin.de/) +comprises the following steps: +- textline extraction @[sbb_pixelwise_segmentation](https://github.com/qurator-spk/pixelwise_segmentation_SBB) +- word segmentation @[ocrd_tesserocr](https://github.com/OCR-D/ocrd_tesserocr) +- OCR @[ocrd_calamari](https://github.com/qurator-spk/ocrd_calamari) +- Tokenization +- Pretagging @[sbb_ner](https://github.com/qurator-spk/sbb_ner) \ No newline at end of file