2019-08-08 16:37:30 +02:00
# ocrd_calamari
2019-12-03 17:01:01 +01:00
> Recognize text using [Calamari OCR](https://github.com/Calamari-OCR/calamari).
[](https://circleci.com/gh/OCR-D/ocrd_calamari)
[](https://pypi.org/project/ocrd_calamari/)
[](https://codecov.io/gh/OCR-D/ocrd_calamari)
2019-08-08 16:37:30 +02:00
2019-10-26 22:17:58 +02:00
## Introduction
2019-08-08 16:37:30 +02:00
2020-10-01 13:23:44 +02:00
**ocrd_calamari** offers a [OCR-D ](https://ocr-d.de ) compliant workspace processor for the functionality of Calamari OCR. It uses OCR-D workspaces (METS) with [PAGE XML ](https://github.com/PRImA-Research-Lab/PAGE-XML ) documents as input and output.
2019-08-08 16:37:30 +02:00
This processor only operates on the text line level and so needs a line segmentation (and by extension a binarized
image) as its input.
2019-08-08 17:26:02 +02:00
2020-02-05 13:02:10 +01:00
In addition to the line text it may also output word and glyph segmentation
including per-glyph confidence values and per-glyph alternative predictions as
provided by the Calamari OCR engine, using a `textequiv_level` of `word` or
`glyph` . Note that while Calamari does not provide word segmentation, this
processor produces word segmentation inferred from text
2020-02-03 19:10:16 +01:00
segmentation and the glyph positions. The provided glyph and word segmentation
can be used for text extraction and highlighting, but is probably not useful for
further image-based processing.
2020-02-03 15:31:36 +01:00
2020-09-03 11:31:11 +02:00

2019-08-20 15:36:24 +02:00
## Installation
2019-12-03 13:31:25 +01:00
### From PyPI
2019-08-20 15:36:24 +02:00
```
pip install ocrd_calamari
```
2019-12-03 13:33:34 +01:00
### From Repo
```sh
pip install .
```
2019-12-02 13:19:45 +01:00
## Install models
2019-08-08 17:26:02 +02:00
2019-12-02 13:38:36 +01:00
Download models trained on GT4HistOCR data:
```
2020-11-25 12:09:41 +01:00
make gt4histocr-calamari1
ls gt4histocr-calamari1
2019-12-02 13:38:36 +01:00
```
2019-08-20 15:36:24 +02:00
## Example Usage
2020-02-05 17:39:37 +01:00
Before using `ocrd-calamari-recognize` get some example data and model, and
prepare the document for OCR:
```
# Download model and example data
2020-11-25 12:09:41 +01:00
make gt4histocr-calamari1
2020-02-05 17:39:37 +01:00
make actevedef_718448162
# Create binarized images and line segmentation using other OCR-D projects
2020-02-05 17:49:31 +01:00
cd actevedef_718448162
2020-02-05 17:39:37 +01:00
ocrd-olena-binarize -p '{ "impl": "sauvola-ms-split" }' -I OCR-D-IMG -O OCR-D-IMG-BINPAGE,OCR-D-IMG-BIN
ocrd-tesserocr-segment-region -I OCR-D-IMG-BINPAGE -O OCR-D-SEG-REGION
ocrd-tesserocr-segment-line -I OCR-D-SEG-REGION -O OCR-D-SEG-LINE
```
2019-08-20 15:36:24 +02:00
2020-02-05 17:39:37 +01:00
Finally recognize the text using ocrd_calamari and the downloaded model:
```
2020-11-25 12:09:41 +01:00
ocrd-calamari-recognize -p '{ "checkpoint": "../gt4histocr-calamari1/*.ckpt.json" }' -I OCR-D-SEG-LINE -O OCR-D-OCR-CALAMARI
2020-02-05 17:39:37 +01:00
```
2019-08-08 17:27:15 +02:00
2020-02-05 13:33:52 +01:00
You may want to have a look at the [ocrd-tool.json ](ocrd_calamari/ocrd-tool.json ) descriptions
2020-02-05 13:07:56 +01:00
for additional parameters and default values.
2019-12-04 17:38:47 +01:00
## Development & Testing
For information regarding development and testing, please see
[README-DEV.md ](README-DEV.md ).