mirror of
https://github.com/qurator-spk/dinglehopper.git
synced 2025-06-17 23:59:59 +02:00
Add batch processing and report summaries
This commit is contained in:
parent
dd9303b429
commit
207804e6a6
17 changed files with 17584 additions and 26 deletions
58
README.md
58
README.md
|
@ -5,7 +5,8 @@ dinglehopper is an OCR evaluation tool and reads
|
|||
[ALTO](https://github.com/altoxml),
|
||||
[PAGE](https://github.com/PRImA-Research-Lab/PAGE-XML) and text files. It
|
||||
compares a ground truth (GT) document page with a OCR result page to compute
|
||||
metrics and a word/character differences report.
|
||||
metrics and a word/character differences report. It also supports batch processing by
|
||||
generating, aggregating and summarizing multiple reports.
|
||||
|
||||
[](https://circleci.com/gh/qurator-spk/dinglehopper)
|
||||
|
||||
|
@ -27,7 +28,7 @@ sudo pip install .
|
|||
Usage
|
||||
-----
|
||||
~~~
|
||||
Usage: dinglehopper [OPTIONS] GT OCR [REPORT_PREFIX]
|
||||
Usage: dinglehopper [OPTIONS] GT OCR [REPORT_PREFIX] [REPORTS_FOLDER]
|
||||
|
||||
Compare the PAGE/ALTO/text document GT against the document OCR.
|
||||
|
||||
|
@ -35,19 +36,23 @@ Usage: dinglehopper [OPTIONS] GT OCR [REPORT_PREFIX]
|
|||
their text and falls back to plain text if no ALTO or PAGE is detected.
|
||||
|
||||
The files GT and OCR are usually a ground truth document and the result of
|
||||
an OCR software, but you may use dinglehopper to compare two OCR results.
|
||||
In that case, use --no-metrics to disable the then meaningless metrics and
|
||||
also change the color scheme from green/red to blue.
|
||||
an OCR software, but you may use dinglehopper to compare two OCR results. In
|
||||
that case, use --no-metrics to disable the then meaningless metrics and also
|
||||
change the color scheme from green/red to blue.
|
||||
|
||||
The comparison report will be written to $REPORT_PREFIX.{html,json}, where
|
||||
$REPORT_PREFIX defaults to "report". The reports include the character
|
||||
error rate (CER) and the word error rate (WER).
|
||||
The comparison report will be written to
|
||||
$REPORTS_FOLDER/$REPORT_PREFIX.{html,json}, where $REPORTS_FOLDER defaults
|
||||
to the current working directory and $REPORT_PREFIX defaults to "report".
|
||||
The reports include the character error rate (CER) and the word error rate
|
||||
(WER).
|
||||
|
||||
By default, the text of PAGE files is extracted on 'region' level. You may
|
||||
use "--textequiv-level line" to extract from the level of TextLine tags.
|
||||
|
||||
Options:
|
||||
--metrics / --no-metrics Enable/disable metrics and green/red
|
||||
--differences BOOLEAN Enable reporting character and word level
|
||||
differences
|
||||
--textequiv-level LEVEL PAGE TextEquiv level to extract text from
|
||||
--progress Show progress bar
|
||||
--help Show this message and exit.
|
||||
|
@ -61,6 +66,43 @@ This generates `report.html` and `report.json`.
|
|||
|
||||

|
||||
|
||||
Batch comparison between folders of GT and OCR files can be done by simply providing
|
||||
folders:
|
||||
~~~
|
||||
dinglehopper gt/ ocr/ report output_folder/
|
||||
~~~
|
||||
This assumes that you have files with the same name in both folders, e.g.
|
||||
`gt/00000001.page.xml` and `ocr/00000001.alto.xml`.
|
||||
|
||||
The example generates reports for each set of files, with the prefix `report`, in the
|
||||
(automatically created) folder `output_folder/`.
|
||||
|
||||
By default, the JSON report does not contain the character and word differences, only
|
||||
the calculated metrics. If you want to include the differences, use the
|
||||
`--differences` flag:
|
||||
|
||||
~~~
|
||||
dinglehopper gt/ ocr/ report output_folder/ --differences
|
||||
~~~
|
||||
|
||||
### dinglehopper-summarize
|
||||
A set of (JSON) reports can be summarized into a single set of
|
||||
reports. This is useful after having generated reports in batch.
|
||||
Example:
|
||||
~~~
|
||||
dinglehopper-summarize output_folder/
|
||||
~~~
|
||||
This generates `summary.html` and `summary.json` in the same `output_folder`.
|
||||
|
||||
If you are summarizing many reports and have used the `--differences` flag while
|
||||
generating them, it may be useful to limit the number of differences reported by using
|
||||
the `--occurences-threshold` parameter. This will reduce the size of the generated HTML
|
||||
report, making it easier to open and navigate. Note that the JSON report will still
|
||||
contain all differences. Example:
|
||||
~~~
|
||||
dinglehopper-summarize output_folder/ --occurences-threshold 10
|
||||
~~~
|
||||
|
||||
### dinglehopper-line-dirs
|
||||
You also may want to compare a directory of GT text files (i.e. `gt/line0001.gt.txt`)
|
||||
with a directory of OCR text files (i.e. `ocr/line0001.some-ocr.txt`) with a separate
|
||||
|
|
Loading…
Add table
Add a link
Reference in a new issue