dinglehopper/README.md

dinglehopper
============

dinglehopper is an OCR evaluation tool and reads
[ALTO](https://github.com/altoxml),
[PAGE](https://github.com/PRImA-Research-Lab/PAGE-XML) and text files.  It
compares a ground truth (GT) document page with a OCR result page to compute
metrics and a word/character differences report.

[![Build Status](https://travis-ci.org/qurator-spk/dinglehopper.svg?branch=master)](https://travis-ci.org/qurator-spk/dinglehopper)

Goals
-----
* Useful
  * As a UI tool
  * For an automated evaluation
  * As a library
* Unicode support

Installation
------------
It's best to use pip, e.g.:
~~~
sudo pip install .
~~~

Usage
-----
~~~
Usage: dinglehopper [OPTIONS] GT OCR [REPORT_PREFIX]

  Compare the PAGE/ALTO/text document GT against the document OCR.

  dinglehopper detects if GT/OCR are ALTO or PAGE XML documents to extract
  their text and falls back to plain text if no ALTO or PAGE is detected.

  The files GT and OCR are usually a ground truth document and the result of
  an OCR software, but you may use dinglehopper to compare two OCR results.
  In that case, use --no-metrics to disable the then meaningless metrics and
  also change the color scheme from green/red to blue.

  The comparison report will be written to $REPORT_PREFIX.{html,json}, where
  $REPORT_PREFIX defaults to "report". The reports include the character
  error rate (CER) and the word error rate (WER).

  By default, the text of PAGE files is extracted on 'region' level. You may
  use "--textequiv-level line" to extract from the level of TextLine tags.

Options:
  --metrics / --no-metrics  Enable/disable metrics and green/red
  --textequiv-level LEVEL   PAGE TextEquiv level to extract text from
  --progress                Show progress bar
  --help                    Show this message and exit.
~~~

For example:
~~~
dinglehopper some-document.gt.page.xml some-document.ocr.alto.xml
~~~
This generates `report.html` and `report.json`.


As a OCR-D processor:
~~~
ocrd-dinglehopper -I OCR-D-GT-PAGE,OCR-D-OCR-TESS -O OCR-D-OCR-TESS-EVAL
~~~
This generates HTML and JSON reports in the `OCR-D-OCR-TESS-EVAL` filegroup.


![dinglehopper displaying metrics and character differences](.screenshots/dinglehopper.png?raw=true)

You may also want to disable metrics and the green-red color scheme by
parameter:

~~~
ocrd-dinglehopper -I ABBYY-FULLTEXT,OCR-D-OCR-CALAMARI -O OCR-D-OCR-COMPARE-ABBYY-CALAMARI -P metrics false
~~~

The tool `dinglehopper-extract` extracts the text of the given input file on
stdout, for example:

`dinglehopper-extract OCR-D-GT-PAGE/00000024.page.xml`

Developer information
---------------------
*Please refer to [README-DEV.md](README-DEV.md).*
Revert "Merge branch 'master' of https://github.com/qurator-spk/sbb_textline_detector" This reverts commit 2c89bf3b35ee290d7b830ef270df3a96aa48245e, reversing changes made to 9f7e413148ca5dbac9b555d7b0d0a5fa3a0f5340. 5 years ago			`dinglehopper`
			`============`
➡ Move dinglehopper into its own directory 5 years ago
🗒️ dinglehopper: Describe what dinglehopper does in the README 4 years ago			`dinglehopper is an OCR evaluation tool and reads`
			`[ALTO](https://github.com/altoxml),`
			`[PAGE](https://github.com/PRImA-Research-Lab/PAGE-XML) and text files. It`
			`compares a ground truth (GT) document page with a OCR result page to compute`
			`metrics and a word/character differences report.`
📝 dinglehopper: Add screenshot 5 years ago
Revert "Merge branch 'master' of https://github.com/qurator-spk/sbb_textline_detector" This reverts commit 2c89bf3b35ee290d7b830ef270df3a96aa48245e, reversing changes made to 9f7e413148ca5dbac9b555d7b0d0a5fa3a0f5340. 5 years ago			`[![Build Status](https://travis-ci.org/qurator-spk/dinglehopper.svg?branch=master)](https://travis-ci.org/qurator-spk/dinglehopper)`
📝 dinglehopper: Add screenshot 5 years ago
Revert "Merge branch 'master' of https://github.com/qurator-spk/sbb_textline_detector" This reverts commit 2c89bf3b35ee290d7b830ef270df3a96aa48245e, reversing changes made to 9f7e413148ca5dbac9b555d7b0d0a5fa3a0f5340. 5 years ago			`Goals`
			`-----`
			`* Useful`
			`* As a UI tool`
			`* For an automated evaluation`
			`* As a library`
			`* Unicode support`
📝 dinglehopper: Document goals 5 years ago
Revert "Merge branch 'master' of https://github.com/qurator-spk/sbb_textline_detector" This reverts commit 2c89bf3b35ee290d7b830ef270df3a96aa48245e, reversing changes made to 9f7e413148ca5dbac9b555d7b0d0a5fa3a0f5340. 5 years ago			`Installation`
			`------------`
			`It's best to use pip, e.g.:`
			`~~~`
			`sudo pip install .`
			`~~~`
📝 dinglehopper: Document basic CLI usage 5 years ago
Revert "Merge branch 'master' of https://github.com/qurator-spk/sbb_textline_detector" This reverts commit 2c89bf3b35ee290d7b830ef270df3a96aa48245e, reversing changes made to 9f7e413148ca5dbac9b555d7b0d0a5fa3a0f5340. 5 years ago			`Usage`
			`-----`
			`~~~`
✨ dinglehopper: Support disabling the metrics using CLI option --no-metrics 4 years ago			`Usage: dinglehopper [OPTIONS] GT OCR [REPORT_PREFIX]`

			`Compare the PAGE/ALTO/text document GT against the document OCR.`

📝 Document CER/WER and the format detection (Fixes GH-26) 4 years ago			`dinglehopper detects if GT/OCR are ALTO or PAGE XML documents to extract`
			`their text and falls back to plain text if no ALTO or PAGE is detected.`

✨ dinglehopper: Support disabling the metrics using CLI option --no-metrics 4 years ago			`The files GT and OCR are usually a ground truth document and the result of`
			`an OCR software, but you may use dinglehopper to compare two OCR results.`
			`In that case, use --no-metrics to disable the then meaningless metrics and`
			`also change the color scheme from green/red to blue.`

📝 dinglehopper: Document REPORT_PREFIX (Closes GH-27.) 4 years ago			`The comparison report will be written to $REPORT_PREFIX.{html,json}, where`
📝 Document CER/WER and the format detection (Fixes GH-26) 4 years ago			`$REPORT_PREFIX defaults to "report". The reports include the character`
			`error rate (CER) and the word error rate (WER).`
📝 dinglehopper: Document REPORT_PREFIX (Closes GH-27.) 4 years ago
✨ dinglehopper: Add CLI option to choose TextEquiv level 4 years ago			`By default, the text of PAGE files is extracted on 'region' level. You may`
			`use "--textequiv-level line" to extract from the level of TextLine tags.`

✨ dinglehopper: Support disabling the metrics using CLI option --no-metrics 4 years ago			`Options:`
			`--metrics / --no-metrics Enable/disable metrics and green/red`
✨ dinglehopper: Add CLI option to choose TextEquiv level 4 years ago			`--textequiv-level LEVEL PAGE TextEquiv level to extract text from`
✨ dinglehopper: Show a progressbar on --progress 4 years ago			`--progress Show progress bar`
✨ dinglehopper: Support disabling the metrics using CLI option --no-metrics 4 years ago			`--help Show this message and exit.`
			`~~~`

			`For example:`
			`~~~`
Revert "Merge branch 'master' of https://github.com/qurator-spk/sbb_textline_detector" This reverts commit 2c89bf3b35ee290d7b830ef270df3a96aa48245e, reversing changes made to 9f7e413148ca5dbac9b555d7b0d0a5fa3a0f5340. 5 years ago			`dinglehopper some-document.gt.page.xml some-document.ocr.alto.xml`
			`~~~`
			This generates `report.html` and `report.json`.
📝 dinglehopper: Document basic CLI usage 5 years ago
✨ dinglehopper: Add OCR-D interface 5 years ago
Revert "Merge branch 'master' of https://github.com/qurator-spk/sbb_textline_detector" This reverts commit 2c89bf3b35ee290d7b830ef270df3a96aa48245e, reversing changes made to 9f7e413148ca5dbac9b555d7b0d0a5fa3a0f5340. 5 years ago			`As a OCR-D processor:`
📝 dinglehopper: Document installation + testing 5 years ago			`~~~`
🗒️ dinglehopper: Remove superfluous `-m mets.xml` in the README OCR-D example 4 years ago			`ocrd-dinglehopper -I OCR-D-GT-PAGE,OCR-D-OCR-TESS -O OCR-D-OCR-TESS-EVAL`
📝 dinglehopper: Document installation + testing 5 years ago			`~~~`
Revert "Merge branch 'master' of https://github.com/qurator-spk/sbb_textline_detector" This reverts commit 2c89bf3b35ee290d7b830ef270df3a96aa48245e, reversing changes made to 9f7e413148ca5dbac9b555d7b0d0a5fa3a0f5340. 5 years ago			This generates HTML and JSON reports in the `OCR-D-OCR-TESS-EVAL` filegroup.
Merge branch 'master' of https://github.com/qurator-spk/sbb_textline_detector 5 years ago
Revert "Merge branch 'master' of https://github.com/qurator-spk/sbb_textline_detector" This reverts commit 2c89bf3b35ee290d7b830ef270df3a96aa48245e, reversing changes made to 9f7e413148ca5dbac9b555d7b0d0a5fa3a0f5340. 5 years ago
			`![dinglehopper displaying metrics and character differences](.screenshots/dinglehopper.png?raw=true)`

✨ dinglehopper: Support disabling metrics in the OCR-D interface 4 years ago			`You may also want to disable metrics and the green-red color scheme by`
			`parameter:`

			`~~~`
📝 dinglehopper: Update README to use OCR-D's new and more readable -P option 4 years ago			`ocrd-dinglehopper -I ABBYY-FULLTEXT,OCR-D-OCR-CALAMARI -O OCR-D-OCR-COMPARE-ABBYY-CALAMARI -P metrics false`
✨ dinglehopper: Support disabling metrics in the OCR-D interface 4 years ago			`~~~`

✨ Add a new CLI tool dinglehopper-extract to just give the extracted text 4 years ago			The tool `dinglehopper-extract` extracts the text of the given input file on
			`stdout, for example:`

			`dinglehopper-extract OCR-D-GT-PAGE/00000024.page.xml`

📝 dinglehopper: Move developer info to README-DEV.md 4 years ago			`Developer information`
			`---------------------`
📝 dinglehopper: Fix markdown in README.md 4 years ago			`Please refer to [README-DEV.md](README-DEV.md).`