You cannot select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
Go to file
Mike Gerber 35be58cb94
Merge pull request #83 from INL/feat/batch-processing
Add batch processing and report summaries
2 weeks ago
.circleci Upload test results to CircleCI 1 month ago
.screenshots 📝 dinglehopper: Update screenshot to include a region id tooltip 3 years ago
dinglehopper Add batch processing and report summaries 4 weeks ago
.coveragerc Added some helpful tools and configurations 3 years ago
.drone.jsonnet ✔️ DroneCI: Build on Python 3.6 → 3.10 1 year ago
.editorconfig Preparation for black code formatter 3 years ago
.gitignore 🧹 .gitignore .python-version (for pyenv) 2 months ago
LICENSE Revert "Merge branch 'master' of" 4 years ago 🧹 Remove qurator. namespace prefix 2 months ago Add batch processing and report summaries 4 weeks ago
ocrd-tool.json 🧹 Remove qurator. namespace prefix 2 months ago
pytest.ini Revert "Merge branch 'master' of" 4 years ago
requirements-dev.txt Add black to developer requirements. 3 years ago
requirements.txt Revert "Revert "Merge pull request #67 from maxbachmann/rapidfuzz"" 10 months ago
setup.cfg Added some helpful tools and configurations 3 years ago Add batch processing and report summaries 4 weeks ago


dinglehopper is an OCR evaluation tool and reads ALTO, PAGE and text files. It compares a ground truth (GT) document page with a OCR result page to compute metrics and a word/character differences report. It also supports batch processing by generating, aggregating and summarizing multiple reports.

Build Status


  • Useful
    • As a UI tool
    • For an automated evaluation
    • As a library
  • Unicode support


It's best to use pip, e.g.:

sudo pip install .



  Compare the PAGE/ALTO/text document GT against the document OCR.

  dinglehopper detects if GT/OCR are ALTO or PAGE XML documents to extract
  their text and falls back to plain text if no ALTO or PAGE is detected.

  The files GT and OCR are usually a ground truth document and the result of
  an OCR software, but you may use dinglehopper to compare two OCR results. In
  that case, use --no-metrics to disable the then meaningless metrics and also
  change the color scheme from green/red to blue.

  The comparison report will be written to
  $REPORTS_FOLDER/$REPORT_PREFIX.{html,json}, where $REPORTS_FOLDER defaults
  to the current working directory and $REPORT_PREFIX defaults to "report".
  The reports include the character error rate (CER) and the word error rate

  By default, the text of PAGE files is extracted on 'region' level. You may
  use "--textequiv-level line" to extract from the level of TextLine tags.

  --metrics / --no-metrics  Enable/disable metrics and green/red
  --differences BOOLEAN     Enable reporting character and word level
  --textequiv-level LEVEL   PAGE TextEquiv level to extract text from
  --progress                Show progress bar
  --help                    Show this message and exit.

For example:

dinglehopper some-document.ocr.alto.xml

This generates report.html and report.json.

dinglehopper displaying metrics and character differences

Batch comparison between folders of GT and OCR files can be done by simply providing folders:

dinglehopper gt/ ocr/ report output_folder/

This assumes that you have files with the same name in both folders, e.g. gt/ and ocr/00000001.alto.xml.

The example generates reports for each set of files, with the prefix report, in the (automatically created) folder output_folder/.

By default, the JSON report does not contain the character and word differences, only the calculated metrics. If you want to include the differences, use the --differences flag:

dinglehopper gt/ ocr/ report output_folder/ --differences


A set of (JSON) reports can be summarized into a single set of reports. This is useful after having generated reports in batch. Example:

dinglehopper-summarize output_folder/

This generates summary.html and summary.json in the same output_folder.

If you are summarizing many reports and have used the --differences flag while generating them, it may be useful to limit the number of differences reported by using the --occurences-threshold parameter. This will reduce the size of the generated HTML report, making it easier to open and navigate. Note that the JSON report will still contain all differences. Example:

dinglehopper-summarize output_folder/ --occurences-threshold 10


You also may want to compare a directory of GT text files (i.e. gt/ with a directory of OCR text files (i.e. ocr/line0001.some-ocr.txt) with a separate CLI interface:

dinglehopper-line-dirs gt/ ocr/


The tool dinglehopper-extract extracts the text of the given input file on stdout, for example:

dinglehopper-extract --textequiv-level line OCR-D-GT-PAGE/


As a OCR-D processor:


This generates HTML and JSON reports in the OCR-D-OCR-TESS-EVAL filegroup.

The OCR-D processor has these parameters:

Parameter Meaning
-P metrics false Disable metrics and the green-red color scheme (default: enabled)
-P textequiv_level line (PAGE) Extract text from TextLine level (default: TextRegion level)

For example:


Developer information

Please refer to