mirror of
https://github.com/qurator-spk/sbb_textline_detection.git
synced 2025-06-14 14:20:03 +02:00
Update README.md
This commit is contained in:
parent
8872131a43
commit
601ae8bff7
1 changed files with 6 additions and 0 deletions
|
@ -9,6 +9,12 @@ The goal of this project is to extract textlines of a document to feed an ocr mo
|
||||||
* Layout analysis
|
* Layout analysis
|
||||||
* Textline detection
|
* Textline detection
|
||||||
* Heuristic methods
|
* Heuristic methods
|
||||||
|
First three stages are done by using a pixel-wise segmentation. You can train your own model using this tool (https://github.com/qurator-spk/sbb_pixelwise_segmentation).
|
||||||
|
|
||||||
|
## Printspace or border extraction
|
||||||
|
From ocr point of view and in order to avoid texts outside printspace region, you need to detect and extract printspace region. As mentioned briefly earlier this is done by a binary pixelwise-segmentation. We have trained our model by a dataset of 2000 documents where about 1200 of them was from dhsegment project (you can download the dataset from here https://github.com/dhlab-epfl/dhSegment/releases/download/v0.2/pages.zip) and the rest was annotated by myself using our dataset in SBB.
|
||||||
|
This is worthy to mention that for page (printspace or border) extractation you have to feed model whole image at once and not in patches.
|
||||||
|
|
||||||
|
|
||||||
## Installation
|
## Installation
|
||||||
`pip install .`
|
`pip install .`
|
||||||
|
|
Loading…
Add table
Add a link
Reference in a new issue