You cannot select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
sbb_textline_detection/README.md

42 lines
2.1 KiB
Markdown

# Textline Detection
4 years ago
> Detect textlines in document images
## Introduction
This tool performs printspace, region and textline detection from document image
4 years ago
data and returns the results as [PAGE-XML](https://github.com/PRImA-Research-Lab/PAGE-XML).
4 years ago
The goal of this project is to extract textlines of a document to feed an ocr model. This is achieved by four successive stages as follows:
4 years ago
* Printspace or border extraction
* Layout analysis
* Textline detection
4 years ago
* Heuristic methods
<br/>
4 years ago
First three stages are done by using a pixel-wise segmentation. You can train your own model using this tool (https://github.com/qurator-spk/sbb_pixelwise_segmentation).
## Printspace or border extraction
From ocr point of view and in order to avoid texts outside printspace region, you need to detect and extract printspace region. As mentioned briefly earlier this is done by a binary pixelwise-segmentation. We have trained our model by a dataset of 2000 documents where about 1200 of them was from dhsegment project (you can download the dataset from here https://github.com/dhlab-epfl/dhSegment/releases/download/v0.2/pages.zip) and the rest was annotated by myself using our dataset in SBB.
This is worthy to mention that for page (printspace or border) extractation you have to feed model whole image at once and not in patches.
## Installation
`pip install .`
4 years ago
4 years ago
### Models
4 years ago
In order to run this tool you also need trained models. You can download our pretrained models from here:
4 years ago
https://qurator-data.de/sbb_textline_detector/
4 years ago
## Usage
`sbb_textline_detector -i <image file name> -o <directory to write output xml> -m <directory of models>`
4 years ago
4 years ago
### Usage with OCR-D
~~~
ocrd-example-binarize -I OCR-D-IMG -O OCR-D-IMG-BIN
ocrd-sbb-textline-detector -I OCR-D-IMG-BIN -O OCR-D-SEG-LINE-SBB \
-p '{ "model": "/path/to/the/models/textline_detection" }'
~~~
Segmentation works on raw RGB images, but retains
`AlternativeImage`s from binarization steps, so it's OK to do
binarization first, then perform the textline detection. The used binarization
processor must produce an `AlternativeImage` for the binarized image, not
replace the original raw RGB image.