mirror of https://github.com/qurator-spk/dinglehopper.git synced 2026-01-11 12:57:31 +01:00

No description

Find a file

Benjamin Rosemann b24d8d5664 Performance increases Temporarily switch to the c-implementation of python-levenshtein for editops calculatation. Also added some variables, caching and type changes for performance gains.		2021-02-16 11:28:24 +01:00
.circleci	🚧 Replace Travis with CircleCI	2021-02-10 17:58:58 +01:00
.screenshots	📝 dinglehopper: Update screenshot to include a region id tooltip	2020-10-08 17:17:34 +02:00
qurator	Performance increases	2021-02-16 11:28:24 +01:00
.coveragerc	Added some helpful tools and configurations	2020-11-11 11:36:17 +01:00
.drone.jsonnet	🚧 dinglehopper: Try out Drone CI	2021-02-11 14:26:29 +01:00
.editorconfig	Preparation for black code formatter	2020-11-11 11:36:17 +01:00
.gitignore	Remove .idea folder and modify .gitignore	2020-11-11 11:36:17 +01:00
LICENSE	Revert "Merge branch 'master' of https://github.com/qurator-spk/sbb_textline_detector "	2019-12-09 12:44:05 +01:00
ocrd-tool.json	Revert "Merge branch 'master' of https://github.com/qurator-spk/sbb_textline_detector "	2019-12-09 12:44:05 +01:00
pytest.ini	Readd pytest.ini	2021-02-16 11:28:23 +01:00
README-DEV.md	📝 dinglehopper: README-DEV: Massage markdown a bit	2020-11-12 19:05:14 +01:00
README.md	Include fca as parameter and add some tests	2021-02-16 11:28:23 +01:00
requirements-dev.txt	Add black to developer requirements.	2020-11-11 11:36:17 +01:00
requirements.txt	Performance increases	2021-02-16 11:28:24 +01:00
setup.cfg	Added some helpful tools and configurations	2020-11-11 11:36:17 +01:00
setup.py	Added some helpful tools and configurations	2020-11-11 11:36:17 +01:00

README.md

dinglehopper

dinglehopper is an OCR evaluation tool and reads ALTO, PAGE and text files. It compares a ground truth (GT) document page with a OCR result page to compute metrics and a word/character differences report.

Goals

Useful
- As a UI tool
- For an automated evaluation
- As a library
Unicode support

Installation

It's best to use pip, e.g.:

sudo pip install .

Usage

Usage: dinglehopper [OPTIONS] GT OCR [REPORT_PREFIX]

  Compare the PAGE/ALTO/text document GT against the document OCR.

  dinglehopper detects if GT/OCR are ALTO or PAGE XML documents to extract
  their text and falls back to plain text if no ALTO or PAGE is detected.

  The files GT and OCR are usually a ground truth document and the result of
  an OCR software, but you may use dinglehopper to compare two OCR results. In
  that case, use --metrics='' to disable the then meaningless metrics and also
  change the color scheme from green/red to blue.

  The comparison report will be written to $REPORT_PREFIX.{html,json}, where
  $REPORT_PREFIX defaults to "report". Depending on your configuration the
  reports include the character error rate (CER), the word error rate (WER)
  and the flexible character accuracy (FCA).

  The metrics can be chosen via a comma separated combination of their acronyms
  like "--metrics=cer,wer,fca".

  By default, the text of PAGE files is extracted on 'region' level. You may
  use "--textequiv-level line" to extract from the level of TextLine tags.

Options:
  --metrics                 Enable different metrics like cer, wer and fca.
  --textequiv-level LEVEL   PAGE TextEquiv level to extract text from
  --progress                Show progress bar
  --help                    Show this message and exit.

For example:

dinglehopper some-document.gt.page.xml some-document.ocr.alto.xml

This generates report.html and report.json.

dinglehopper-extract

The tool dinglehopper-extract extracts the text of the given input file on stdout, for example:

dinglehopper-extract --textequiv-level line OCR-D-GT-PAGE/00000024.page.xml

OCR-D

As a OCR-D processor:

ocrd-dinglehopper -I OCR-D-GT-PAGE,OCR-D-OCR-TESS -O OCR-D-OCR-TESS-EVAL

This generates HTML and JSON reports in the OCR-D-OCR-TESS-EVAL filegroup.

The OCR-D processor has these parameters:

Parameter	Meaning
`-P metrics cer,wer`	Enable character error rate and word error rate (default)
`-P textequiv_level line`	(PAGE) Extract text from TextLine level (default: TextRegion level)

For example:

ocrd-dinglehopper -I ABBYY-FULLTEXT,OCR-D-OCR-CALAMARI -O OCR-D-OCR-COMPARE-ABBYY-CALAMARI -P metrics cer,wer

Developer information

Please refer to README-DEV.md.