From 601ae8bff75995b1661a65b3365b26e52cab46d1 Mon Sep 17 00:00:00 2001 From: vahidrezanezhad Date: Mon, 3 Aug 2020 13:04:29 +0200 Subject: [PATCH] Update README.md --- README.md | 6 ++++++ 1 file changed, 6 insertions(+) diff --git a/README.md b/README.md index 130bd3f..46aed13 100644 --- a/README.md +++ b/README.md @@ -9,6 +9,12 @@ The goal of this project is to extract textlines of a document to feed an ocr mo * Layout analysis * Textline detection * Heuristic methods +First three stages are done by using a pixel-wise segmentation. You can train your own model using this tool (https://github.com/qurator-spk/sbb_pixelwise_segmentation). + +## Printspace or border extraction +From ocr point of view and in order to avoid texts outside printspace region, you need to detect and extract printspace region. As mentioned briefly earlier this is done by a binary pixelwise-segmentation. We have trained our model by a dataset of 2000 documents where about 1200 of them was from dhsegment project (you can download the dataset from here https://github.com/dhlab-epfl/dhSegment/releases/download/v0.2/pages.zip) and the rest was annotated by myself using our dataset in SBB. +This is worthy to mention that for page (printspace or border) extractation you have to feed model whole image at once and not in patches. + ## Installation `pip install .`