You cannot select more than 25 topics
Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
Temporarily switch to the c-implementation of python-levenshtein for editops calculatation. Also added some variables, caching and type changes for performance gains. |
4 years ago | |
---|---|---|
.circleci | 4 years ago | |
.screenshots | 5 years ago | |
qurator | 4 years ago | |
.coveragerc | 5 years ago | |
.drone.jsonnet | 4 years ago | |
.editorconfig | 5 years ago | |
.gitignore | 5 years ago | |
LICENSE | 5 years ago | |
README-DEV.md | 4 years ago | |
README.md | 4 years ago | |
ocrd-tool.json | 5 years ago | |
pytest.ini | 4 years ago | |
requirements-dev.txt | 5 years ago | |
requirements.txt | 4 years ago | |
setup.cfg | 5 years ago | |
setup.py | 5 years ago |
README.md
dinglehopper
dinglehopper is an OCR evaluation tool and reads ALTO, PAGE and text files. It compares a ground truth (GT) document page with a OCR result page to compute metrics and a word/character differences report.
Goals
- Useful
- As a UI tool
- For an automated evaluation
- As a library
- Unicode support
Installation
It's best to use pip, e.g.:
sudo pip install .
Usage
Usage: dinglehopper [OPTIONS] GT OCR [REPORT_PREFIX]
Compare the PAGE/ALTO/text document GT against the document OCR.
dinglehopper detects if GT/OCR are ALTO or PAGE XML documents to extract
their text and falls back to plain text if no ALTO or PAGE is detected.
The files GT and OCR are usually a ground truth document and the result of
an OCR software, but you may use dinglehopper to compare two OCR results. In
that case, use --metrics='' to disable the then meaningless metrics and also
change the color scheme from green/red to blue.
The comparison report will be written to $REPORT_PREFIX.{html,json}, where
$REPORT_PREFIX defaults to "report". Depending on your configuration the
reports include the character error rate (CER), the word error rate (WER)
and the flexible character accuracy (FCA).
The metrics can be chosen via a comma separated combination of their acronyms
like "--metrics=cer,wer,fca".
By default, the text of PAGE files is extracted on 'region' level. You may
use "--textequiv-level line" to extract from the level of TextLine tags.
Options:
--metrics Enable different metrics like cer, wer and fca.
--textequiv-level LEVEL PAGE TextEquiv level to extract text from
--progress Show progress bar
--help Show this message and exit.
For example:
dinglehopper some-document.gt.page.xml some-document.ocr.alto.xml
This generates report.html
and report.json
.
dinglehopper-extract
The tool dinglehopper-extract
extracts the text of the given input file on
stdout, for example:
dinglehopper-extract --textequiv-level line OCR-D-GT-PAGE/00000024.page.xml
OCR-D
As a OCR-D processor:
ocrd-dinglehopper -I OCR-D-GT-PAGE,OCR-D-OCR-TESS -O OCR-D-OCR-TESS-EVAL
This generates HTML and JSON reports in the OCR-D-OCR-TESS-EVAL
filegroup.
The OCR-D processor has these parameters:
Parameter | Meaning |
---|---|
-P metrics cer,wer |
Enable character error rate and word error rate (default) |
-P textequiv_level line |
(PAGE) Extract text from TextLine level (default: TextRegion level) |
For example:
ocrd-dinglehopper -I ABBYY-FULLTEXT,OCR-D-OCR-CALAMARI -O OCR-D-OCR-COMPARE-ABBYY-CALAMARI -P metrics cer,wer
Developer information
Please refer to README-DEV.md.