You cannot select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
 
 
 
 
 
 
Go to file
Gerber, Mike 82217a25bb 🧹 dinglehopper: Move all normalization code to extracted_text.py 4 years ago
.idea ⚙️ dinglehopper: PyCharm should use dinglehopper-github virtualenv 4 years ago
.screenshots 📝 dinglehopper: Update screenshot to include a region id tooltip 4 years ago
qurator 🧹 dinglehopper: Move all normalization code to extracted_text.py 4 years ago
.gitignore 🧹 dinglehopper: .gitignore Python stuff 4 years ago
.travis.yml Revert "Merge branch 'master' of https://github.com/qurator-spk/sbb_textline_detector" 5 years ago
LICENSE Revert "Merge branch 'master' of https://github.com/qurator-spk/sbb_textline_detector" 5 years ago
README-DEV.md 📝 dinglehopper: Move developer info to README-DEV.md 4 years ago
README.md 📝 dinglehopper: Fix markdown in README.md 4 years ago
ocrd-tool.json Revert "Merge branch 'master' of https://github.com/qurator-spk/sbb_textline_detector" 5 years ago
pytest.ini Revert "Merge branch 'master' of https://github.com/qurator-spk/sbb_textline_detector" 5 years ago
requirements.txt Merge branch 'feat/display-segment-id' 4 years ago
setup.cfg 💄 Set maximum line length to 90 4 years ago
setup.py Revert "Merge branch 'master' of https://github.com/qurator-spk/sbb_textline_detector" 5 years ago

README.md

dinglehopper

dinglehopper is an OCR evaluation tool and reads ALTO, PAGE and text files. It compares a ground truth (GT) document page with a OCR result page to compute metrics and a word/character differences report.

Build Status

Goals

  • Useful
    • As a UI tool
    • For an automated evaluation
    • As a library
  • Unicode support

Installation

It's best to use pip, e.g.:

sudo pip install .

Usage

Usage: dinglehopper [OPTIONS] GT OCR [REPORT_PREFIX]

  Compare the PAGE/ALTO/text document GT against the document OCR.

  dinglehopper detects if GT/OCR are ALTO or PAGE XML documents to extract
  their text and falls back to plain text if no ALTO or PAGE is detected.

  The files GT and OCR are usually a ground truth document and the result of
  an OCR software, but you may use dinglehopper to compare two OCR results.
  In that case, use --no-metrics to disable the then meaningless metrics and
  also change the color scheme from green/red to blue.

  The comparison report will be written to $REPORT_PREFIX.{html,json}, where
  $REPORT_PREFIX defaults to "report". The reports include the character
  error rate (CER) and the word error rate (WER).

Options:
  --metrics / --no-metrics  Enable/disable metrics and green/red
  --help                    Show this message and exit.

For example:

dinglehopper some-document.gt.page.xml some-document.ocr.alto.xml

This generates report.html and report.json.

As a OCR-D processor:

ocrd-dinglehopper -I OCR-D-GT-PAGE,OCR-D-OCR-TESS -O OCR-D-OCR-TESS-EVAL

This generates HTML and JSON reports in the OCR-D-OCR-TESS-EVAL filegroup.

dinglehopper displaying metrics and character differences

You may also want to disable metrics and the green-red color scheme by parameter:

ocrd-dinglehopper -I ABBYY-FULLTEXT,OCR-D-OCR-CALAMARI -O OCR-D-OCR-COMPARE-ABBYY-CALAMARI -p '{"metrics": false}'

Developer information

Please refer to README-DEV.md.