sbb_textline_detection/README.md

# Textline Detection
> Detect textlines in document images

## Introduction
This tool performs printspace, region and textline detection from document image
data and returns the results as [PAGE-XML](https://github.com/PRImA-Research-Lab/PAGE-XML).
The goal of this project is to extract textlines of a document to feed an ocr model. This is achieved by four successive stages as follows:
* Printspace or border extraction
* Layout analysis
* Textline detection
* Heuristic methods
<br/>
First three stages are done by using a pixel-wise segmentation. You can train your own model using this tool (https://github.com/qurator-spk/sbb_pixelwise_segmentation).

## Printspace or border extraction
From ocr point of view and in order to avoid texts outside printspace region, you need to detect and extract printspace region. As mentioned briefly earlier this is done by a binary pixelwise-segmentation. We have trained our model by a dataset of 2000 documents where about 1200 of them was from dhsegment project (you can download the dataset from here https://github.com/dhlab-epfl/dhSegment/releases/download/v0.2/pages.zip) and the rest was annotated by myself using our dataset in SBB. 
This is worthy to mention that for page (printspace or border) extractation you have to feed model whole image at once and not in patches.


## Installation
`pip install .`

### Models
In order to run this tool you also need trained models. You can download our pretrained models from here:   
https://qurator-data.de/sbb_textline_detector/

## Usage
`sbb_textline_detector -i <image file name> -o <directory to write output xml> -m <directory of models>`

### Usage with OCR-D
~~~
ocrd-example-binarize -I OCR-D-IMG -O OCR-D-IMG-BIN
ocrd-sbb-textline-detector -I OCR-D-IMG-BIN -O OCR-D-SEG-LINE-SBB \
        -p '{ "model": "/path/to/the/models/textline_detection" }'
~~~

Segmentation works on raw RGB images, but retains
`AlternativeImage`s from binarization steps, so it's OK to do
binarization first, then perform the textline detection. The used binarization
processor must produce an `AlternativeImage` for the binarized image, not
replace the original raw RGB image.
Revert "Merge branch 'master' of https://github.com/qurator-spk/sbb_textline_detector" This reverts commit 417b9235d55c0a98ebc665eeeed3844ed40a6c44, reversing changes made to a74974b7b68551135f77e9544fd6717dcaf762b8. 2019-12-09 15:11:25 +01:00			`# Textline Detection`
Update README.md 2020-01-15 19:41:49 +01:00			`> Detect textlines in document images`
🧹 sbb_textline_docker: Rename to sbb_textline_detector 2019-10-10 16:13:07 +02:00
Revert "Merge branch 'master' of https://github.com/qurator-spk/sbb_textline_detector" This reverts commit 417b9235d55c0a98ebc665eeeed3844ed40a6c44, reversing changes made to a74974b7b68551135f77e9544fd6717dcaf762b8. 2019-12-09 15:11:25 +01:00			`## Introduction`
📝 sbb_textline_detection: Document that this also does printspace and region detection 2019-12-11 13:33:21 +01:00			`This tool performs printspace, region and textline detection from document image`
Update README.md 2020-01-15 19:41:49 +01:00			`data and returns the results as [PAGE-XML](https://github.com/PRImA-Research-Lab/PAGE-XML).`
Update README.md 2020-08-03 12:45:55 +02:00			`The goal of this project is to extract textlines of a document to feed an ocr model. This is achieved by four successive stages as follows:`
Update README.md 2020-08-03 12:46:42 +02:00			`* Printspace or border extraction`
			`* Layout analysis`
			`* Textline detection`
Update README.md 2020-08-03 13:24:51 +02:00			`* Heuristic methods`
			`<br/>`
Update README.md 2020-08-03 13:04:29 +02:00			`First three stages are done by using a pixel-wise segmentation. You can train your own model using this tool (https://github.com/qurator-spk/sbb_pixelwise_segmentation).`

			`## Printspace or border extraction`
			`From ocr point of view and in order to avoid texts outside printspace region, you need to detect and extract printspace region. As mentioned briefly earlier this is done by a binary pixelwise-segmentation. We have trained our model by a dataset of 2000 documents where about 1200 of them was from dhsegment project (you can download the dataset from here https://github.com/dhlab-epfl/dhSegment/releases/download/v0.2/pages.zip) and the rest was annotated by myself using our dataset in SBB.`
			`This is worthy to mention that for page (printspace or border) extractation you have to feed model whole image at once and not in patches.`

🧹 sbb_textline_docker: Rename to sbb_textline_detector 2019-10-10 16:13:07 +02:00
Revert "Merge branch 'master' of https://github.com/qurator-spk/sbb_textline_detector" This reverts commit 417b9235d55c0a98ebc665eeeed3844ed40a6c44, reversing changes made to a74974b7b68551135f77e9544fd6717dcaf762b8. 2019-12-09 15:11:25 +01:00			`## Installation`
			`pip install .`
Update README.md 2019-12-05 16:30:09 +01:00
Update README.md 2020-01-16 15:57:20 +01:00			`### Models`
Update README.md 2020-01-15 19:41:49 +01:00			`In order to run this tool you also need trained models. You can download our pretrained models from here:`
Update README.md 2020-01-16 15:47:04 +01:00			`https://qurator-data.de/sbb_textline_detector/`
Update README.md 2019-12-05 16:30:09 +01:00
Revert "Merge branch 'master' of https://github.com/qurator-spk/sbb_textline_detector" This reverts commit 417b9235d55c0a98ebc665eeeed3844ed40a6c44, reversing changes made to a74974b7b68551135f77e9544fd6717dcaf762b8. 2019-12-09 15:11:25 +01:00			`## Usage`
			`sbb_textline_detector -i <image file name> -o <directory to write output xml> -m <directory of models>`
Update README.md 2019-12-05 16:15:07 +01:00
Update README.md 2020-01-16 15:57:20 +01:00			`### Usage with OCR-D`
📝 sbb_textline_detector: Document OCR-D Usage 2019-12-06 11:42:23 +01:00			`~~~`
			`ocrd-example-binarize -I OCR-D-IMG -O OCR-D-IMG-BIN`
Revert "Merge branch 'master' of https://github.com/qurator-spk/sbb_textline_detector" This reverts commit 417b9235d55c0a98ebc665eeeed3844ed40a6c44, reversing changes made to a74974b7b68551135f77e9544fd6717dcaf762b8. 2019-12-09 15:11:25 +01:00			`ocrd-sbb-textline-detector -I OCR-D-IMG-BIN -O OCR-D-SEG-LINE-SBB \`
📝 sbb_textline_detector: Break long line for ocrd_sbb_textline_detector example 2019-12-06 12:34:15 +01:00			`-p '{ "model": "/path/to/the/models/textline_detection" }'`
📝 sbb_textline_detector: Document OCR-D Usage 2019-12-06 11:42:23 +01:00			`~~~`

📝 README.md: Rephrase/correct OCR-D usage info See #32: "respects" is probably an ambiguous or even incorrect term. Also rephrase "it's a good idea" to "it's OK to do". 2020-05-29 17:08:29 +02:00			`Segmentation works on raw RGB images, but retains`
			`AlternativeImage`s from binarization steps, so it's OK to do
📝 sbb_textline_detector: Document OCR-D Usage 2019-12-06 11:42:23 +01:00			`binarization first, then perform the textline detection. The used binarization`
			processor must produce an `AlternativeImage` for the binarized image, not
			`replace the original raw RGB image.`