1
0
Fork 0
mirror of https://github.com/qurator-spk/dinglehopper.git synced 2025-06-08 11:20:26 +02:00

Compare commits

..

34 commits

Author SHA1 Message Date
3443edd6d3
Merge pull request #145 from bertsky/master
update docker
2025-05-13 12:41:50 +02:00
Robert Sachunsky
b1ef3af1a8 docker: use latest core base stage 2025-05-02 00:19:11 +02:00
Robert Sachunsky
d09e3969f8 docker: prepackage ocrd-all-module-dir.json 2025-05-02 00:19:11 +02:00
b5e99d96c9
Merge pull request #144 from qurator-spk/fix/make-test-results-clearer
✔  GitHub Actions: Make reporting results clearer
2025-04-25 11:31:29 +02:00
774790c36f ✔ GitHub Actions: Make reporting results clearer
In the "Actions" tab on GitHub, the workflow run that would post test results to the
_original_ workflow run is named "Test Report". This would lead me to click on it to see
the results, just to be disappointed.

This aims to make the naming of the GitHub workflows/jobs clearer.
2025-04-25 11:20:00 +02:00
addb572922
Merge pull request #143 from qurator-spk/chore/update-pre-commit
⚙  pre-commit: update
2025-04-25 10:14:30 +02:00
1ebb004386 ⚙ pre-commit: update 2025-04-25 10:13:06 +02:00
c3aa48ec3b Merge branch 'master' of https://github.com/qurator-spk/dinglehopper 2025-04-24 17:16:06 +02:00
628594ef98 📦 v0.11.0 2025-04-24 17:14:44 +02:00
d7814db705
Merge pull request #142 from qurator-spk/feat/flex-line-dirs
Feat/flex line dirs
2025-04-24 16:48:22 +02:00
5639f3db7f ✔ Add a tests that checks if plain text files with BOM are read correctly 2025-04-24 16:44:29 +02:00
9fc8937324 ✒ README: Mention dinglehopper-line-dirs --help 2025-04-24 15:13:19 +02:00
14a4bc56d8 🐛 Add --plain-encoding option to dinglehopper-extract 2025-04-22 18:24:35 +02:00
a70260c10e 🐛 Use warning() to fix DeprecationWarning 2025-04-22 13:57:19 +02:00
224aa02163 🚧 Fix help text 2025-04-22 13:57:19 +02:00
9db5b4caf5 🚧 Add OCR-D parameter for plain text encoding 2025-04-22 13:57:19 +02:00
5578ce83a3 🚧 Add option for text encoding to line dir cli 2025-04-22 13:57:19 +02:00
cf59b951a3 🚧 Add option for text encoding to line dir cli 2025-04-22 13:57:19 +02:00
480b3cf864 ✔ Test that CLI produces a complete HTML report 2025-04-22 13:57:19 +02:00
f1a586cff1 ✔ Test line dirs CLI 2025-04-22 13:57:18 +02:00
3b16c14c16 ✔ Properly test line dir finding 2025-04-22 13:57:18 +02:00
322faeb26c 🎨 Sort imports 2025-04-22 13:57:18 +02:00
c37316da09 🐛 cli_line_dirs: Fix word differences section
At the time of generation of the section, the {gt,ocr}_words generators
were drained. Fix by using a list.

Fixes gh-124.
2025-04-22 13:57:18 +02:00
9414a92f9f 🐛 cli_line_dirs: Type-annotate functions 2025-04-22 13:57:18 +02:00
68344e48f8 🎨 Reformat cli_line_dirs 2025-04-22 13:57:18 +02:00
73ee16fe51 🚧 Support 'merged' GT+OCR line directories 2025-04-22 13:57:18 +02:00
6980d7a252 🚧 Use our own removesuffix() as we still support Python 3.8 2025-04-22 13:57:18 +02:00
2bf2529c38 🚧 Port new line dir functions 2025-04-22 13:57:17 +02:00
ad8e6de36b 🐛 cli_line_dirs: Fix character diff reports 2025-04-22 13:57:17 +02:00
4024e350f7 🚧 Test new flexible line dirs functions 2025-04-22 13:57:17 +02:00
3c317cbeaf
Merge pull request #141 from qurator-spk/chore/update-pre-commit
⚙  pre-commit: update
2025-04-22 12:35:14 +02:00
d8403421fc ⚙ pre-commit: update 2025-04-22 12:30:47 +02:00
3305043234
Merge pull request #140 from qurator-spk/fix/vendor-strings
🐛 Fix vendor strings
2025-04-22 11:50:29 +02:00
6bf5bd7178 🐛 Fix vendor strings 2025-04-22 11:48:44 +02:00
32 changed files with 381 additions and 44 deletions

View file

@ -1,4 +1,4 @@
name: Test name: 'Test'
on: on:

View file

@ -1,4 +1,4 @@
name: 'Test Report' name: 'Test - Report results'
on: on:
workflow_run: workflow_run:
workflows: ['test'] workflows: ['test']
@ -15,6 +15,6 @@ jobs:
- uses: dorny/test-reporter@v1 - uses: dorny/test-reporter@v1
with: with:
artifact: /test-results-(.*)/ artifact: /test-results-(.*)/
name: 'Tests Results - $1' name: 'test - Results ($1)'
path: '*junit.xml' path: '*junit.xml'
reporter: java-junit reporter: java-junit

1
.gitignore vendored
View file

@ -25,6 +25,7 @@ dmypy.json
# User-specific stuff # User-specific stuff
.idea .idea
.*.swp
# Build artifacts # Build artifacts
/build /build

View file

@ -16,7 +16,7 @@ repos:
- id: black - id: black
- repo: https://github.com/astral-sh/ruff-pre-commit - repo: https://github.com/astral-sh/ruff-pre-commit
rev: v0.11.5 rev: v0.11.7
hooks: hooks:
- args: - args:
- --fix - --fix

View file

@ -7,7 +7,7 @@ LABEL \
org.label-schema.vcs-ref=$VCS_REF \ org.label-schema.vcs-ref=$VCS_REF \
org.label-schema.vcs-url="https://github.com/qurator-spk/dinglehopper" \ org.label-schema.vcs-url="https://github.com/qurator-spk/dinglehopper" \
org.label-schema.build-date=$BUILD_DATE \ org.label-schema.build-date=$BUILD_DATE \
org.opencontainers.image.vendor="qurator" \ org.opencontainers.image.vendor="Staatsbibliothek zu BerlinSPK" \
org.opencontainers.image.title="dinglehopper" \ org.opencontainers.image.title="dinglehopper" \
org.opencontainers.image.description="An OCR evaluation tool" \ org.opencontainers.image.description="An OCR evaluation tool" \
org.opencontainers.image.source="https://github.com/qurator-spk/dinglehopper" \ org.opencontainers.image.source="https://github.com/qurator-spk/dinglehopper" \
@ -32,6 +32,8 @@ COPY . .
COPY ocrd-tool.json . COPY ocrd-tool.json .
# prepackage ocrd-tool.json as ocrd-all-tool.json # prepackage ocrd-tool.json as ocrd-all-tool.json
RUN ocrd ocrd-tool ocrd-tool.json dump-tools > $(dirname $(ocrd bashlib filename))/ocrd-all-tool.json RUN ocrd ocrd-tool ocrd-tool.json dump-tools > $(dirname $(ocrd bashlib filename))/ocrd-all-tool.json
# prepackage ocrd-all-module-dir.json
RUN ocrd ocrd-tool ocrd-tool.json dump-module-dirs > $(dirname $(ocrd bashlib filename))/ocrd-all-module-dir.json
RUN make install && rm -rf /build/dinglehopper RUN make install && rm -rf /build/dinglehopper
WORKDIR /data WORKDIR /data

View file

@ -186,7 +186,7 @@
same "printed page" as the copyright notice for easier same "printed page" as the copyright notice for easier
identification within third-party archives. identification within third-party archives.
Copyright 2019 qurator Copyright 2019-2025 Staatsbibliothek zu BerlinSPK
Licensed under the Apache License, Version 2.0 (the "License"); Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License. you may not use this file except in compliance with the License.

View file

@ -3,8 +3,9 @@ PIP = pip3
PYTHONIOENCODING=utf8 PYTHONIOENCODING=utf8
PYTEST_ARGS = -vv PYTEST_ARGS = -vv
DOCKER_BASE_IMAGE = docker.io/ocrd/core:v3.3.0 DOCKER_BASE_IMAGE ?= docker.io/ocrd/core:latest
DOCKER_TAG = ocrd/dinglehopper DOCKER_TAG ?= ocrd/dinglehopper
DOCKER ?= docker
help: help:
@echo @echo
@ -24,7 +25,7 @@ test:
pytest $(PYTEST_ARGS) pytest $(PYTEST_ARGS)
docker: docker:
docker build \ $(DOCKER) build \
--build-arg DOCKER_BASE_IMAGE=$(DOCKER_BASE_IMAGE) \ --build-arg DOCKER_BASE_IMAGE=$(DOCKER_BASE_IMAGE) \
--build-arg VCS_REF=$$(git rev-parse --short HEAD) \ --build-arg VCS_REF=$$(git rev-parse --short HEAD) \
--build-arg BUILD_DATE=$$(date -u +"%Y-%m-%dT%H:%M:%SZ") \ --build-arg BUILD_DATE=$$(date -u +"%Y-%m-%dT%H:%M:%SZ") \

View file

@ -112,9 +112,13 @@ You also may want to compare a directory of GT text files (i.e. `gt/line0001.gt.
with a directory of OCR text files (i.e. `ocr/line0001.some-ocr.txt`) with a separate with a directory of OCR text files (i.e. `ocr/line0001.some-ocr.txt`) with a separate
CLI interface: CLI interface:
~~~ ```
dinglehopper-line-dirs gt/ ocr/ dinglehopper-line-dirs gt/ ocr/
~~~ ```
The CLI `dinglehopper-line-dirs` can also work with GT text files in the same
directories as the the OCR text files. You should read `dinglehopper-line-dirs --help`
in this case.
### dinglehopper-extract ### dinglehopper-extract
The tool `dinglehopper-extract` extracts the text of the given input file on The tool `dinglehopper-extract` extracts the text of the given input file on

View file

@ -114,6 +114,7 @@ def process(
metrics: bool = True, metrics: bool = True,
differences: bool = False, differences: bool = False,
textequiv_level: str = "region", textequiv_level: str = "region",
plain_encoding: str = "autodetect",
) -> None: ) -> None:
"""Check OCR result against GT. """Check OCR result against GT.
@ -121,8 +122,12 @@ def process(
this undecorated version and use Click on a wrapper. this undecorated version and use Click on a wrapper.
""" """
gt_text = extract(gt, textequiv_level=textequiv_level) gt_text = extract(
ocr_text = extract(ocr, textequiv_level=textequiv_level) gt, textequiv_level=textequiv_level, plain_encoding=plain_encoding
)
ocr_text = extract(
ocr, textequiv_level=textequiv_level, plain_encoding=plain_encoding
)
gt_words: List[str] = list(words_normalized(gt_text)) gt_words: List[str] = list(words_normalized(gt_text))
ocr_words: List[str] = list(words_normalized(ocr_text)) ocr_words: List[str] = list(words_normalized(ocr_text))
@ -195,6 +200,7 @@ def process_dir(
metrics: bool = True, metrics: bool = True,
differences: bool = False, differences: bool = False,
textequiv_level: str = "region", textequiv_level: str = "region",
plain_encoding: str = "autodetect",
) -> None: ) -> None:
for gt_file in os.listdir(gt): for gt_file in os.listdir(gt):
gt_file_path = os.path.join(gt, gt_file) gt_file_path = os.path.join(gt, gt_file)
@ -209,6 +215,7 @@ def process_dir(
metrics=metrics, metrics=metrics,
differences=differences, differences=differences,
textequiv_level=textequiv_level, textequiv_level=textequiv_level,
plain_encoding=plain_encoding,
) )
else: else:
print("Skipping {0} and {1}".format(gt_file_path, ocr_file_path)) print("Skipping {0} and {1}".format(gt_file_path, ocr_file_path))
@ -233,6 +240,11 @@ def process_dir(
help="PAGE TextEquiv level to extract text from", help="PAGE TextEquiv level to extract text from",
metavar="LEVEL", metavar="LEVEL",
) )
@click.option(
"--plain-encoding",
default="autodetect",
help='Encoding (e.g. "utf-8") of plain text files',
)
@click.option("--progress", default=False, is_flag=True, help="Show progress bar") @click.option("--progress", default=False, is_flag=True, help="Show progress bar")
@click.version_option() @click.version_option()
def main( def main(
@ -243,6 +255,7 @@ def main(
metrics, metrics,
differences, differences,
textequiv_level, textequiv_level,
plain_encoding,
progress, progress,
): ):
""" """
@ -280,6 +293,7 @@ def main(
metrics=metrics, metrics=metrics,
differences=differences, differences=differences,
textequiv_level=textequiv_level, textequiv_level=textequiv_level,
plain_encoding=plain_encoding,
) )
else: else:
process( process(
@ -290,6 +304,7 @@ def main(
metrics=metrics, metrics=metrics,
differences=differences, differences=differences,
textequiv_level=textequiv_level, textequiv_level=textequiv_level,
plain_encoding=plain_encoding,
) )

View file

@ -12,7 +12,12 @@ from .ocr_files import extract
help="PAGE TextEquiv level to extract text from", help="PAGE TextEquiv level to extract text from",
metavar="LEVEL", metavar="LEVEL",
) )
def main(input_file, textequiv_level): @click.option(
"--plain-encoding",
default="autodetect",
help='Encoding (e.g. "utf-8") of plain text files',
)
def main(input_file, textequiv_level, plain_encoding):
""" """
Extract the text of the given INPUT_FILE. Extract the text of the given INPUT_FILE.
@ -23,7 +28,9 @@ def main(input_file, textequiv_level):
use "--textequiv-level line" to extract from the level of TextLine tags. use "--textequiv-level line" to extract from the level of TextLine tags.
""" """
initLogging() initLogging()
input_text = extract(input_file, textequiv_level=textequiv_level).text input_text = extract(
input_file, textequiv_level=textequiv_level, plain_encoding=plain_encoding
).text
print(input_text) print(input_text)

View file

@ -1,5 +1,6 @@
import itertools import itertools
import os import os
from typing import Callable, Iterator, List, Optional, Tuple
import click import click
from jinja2 import Environment, FileSystemLoader from jinja2 import Environment, FileSystemLoader
@ -12,6 +13,41 @@ from .ocr_files import plain_extract
from .word_error_rate import word_error_rate_n, words_normalized from .word_error_rate import word_error_rate_n, words_normalized
def removesuffix(text, suffix):
"""
Remove suffix from text.
Can be replaced with str.removesuffix when we only support Python >= 3.9.
"""
if suffix and text.endswith(suffix):
return text[: -len(suffix)]
return text
def is_hidden(filepath):
filename = os.path.basename(os.path.abspath(filepath))
return filename.startswith(".")
def find_all_files(
dir_: str, pred: Optional[Callable[[str], bool]] = None, return_hidden: bool = False
) -> Iterator[str]:
"""
Find all files in dir_, returning filenames
If pred is given, pred(filename) must be True for the filename.
Does not return hidden files by default.
"""
for root, _, filenames in os.walk(dir_):
for fn in filenames:
if not return_hidden and is_hidden(fn):
continue
if pred and not pred(fn):
continue
yield os.path.join(root, fn)
def all_equal(iterable): def all_equal(iterable):
g = itertools.groupby(iterable) g = itertools.groupby(iterable)
return next(g, True) and not next(g, False) return next(g, True) and not next(g, False)
@ -25,15 +61,63 @@ def common_suffix(its):
return reversed(common_prefix(reversed(it) for it in its)) return reversed(common_prefix(reversed(it) for it in its))
def removesuffix(text, suffix): def find_gt_and_ocr_files(
if suffix and text.endswith(suffix): gt_dir: str, gt_suffix: str, ocr_dir: str, ocr_suffix: str
return text[: -len(suffix)] ) -> Iterator[Tuple[str, str]]:
return text """
Find GT files and matching OCR files.
Returns pairs of GT and OCR files.
"""
for gt_fn in find_all_files(gt_dir, lambda fn: fn.endswith(gt_suffix)):
ocr_fn = os.path.join(
ocr_dir,
removesuffix(os.path.relpath(gt_fn, start=gt_dir), gt_suffix) + ocr_suffix,
)
if not os.path.exists(ocr_fn):
raise RuntimeError(f"{ocr_fn} (matching {gt_fn}) does not exist")
yield gt_fn, ocr_fn
def process(gt_dir, ocr_dir, report_prefix, *, metrics=True): def find_gt_and_ocr_files_autodetect(gt_dir, ocr_dir):
gt_suffix = "".join(common_suffix(os.listdir(gt_dir))) """
ocr_suffix = "".join(common_suffix(os.listdir(ocr_dir))) Find GT files and matching OCR files, autodetect suffixes.
This only works if gt_dir (or respectivley ocr_dir) only contains GT (OCR)
files with a common suffix. Currently the files must have a suffix, e.g.
".gt.txt" (e.g. ".ocr.txt").
Returns pairs of GT and OCR files.
"""
# Autodetect suffixes
gt_files = find_all_files(gt_dir)
gt_suffix = "".join(common_suffix(gt_files))
if len(gt_suffix) == 0:
raise RuntimeError(
f"Files in GT directory {gt_dir} do not have a common suffix"
)
ocr_files = find_all_files(ocr_dir)
ocr_suffix = "".join(common_suffix(ocr_files))
if len(ocr_suffix) == 0:
raise RuntimeError(
f"Files in OCR directory {ocr_dir} do not have a common suffix"
)
yield from find_gt_and_ocr_files(gt_dir, gt_suffix, ocr_dir, ocr_suffix)
def process(
gt_dir,
ocr_dir,
report_prefix,
*,
metrics=True,
gt_suffix=None,
ocr_suffix=None,
plain_encoding="autodetect",
):
cer = None cer = None
n_characters = None n_characters = None
@ -42,16 +126,20 @@ def process(gt_dir, ocr_dir, report_prefix, *, metrics=True):
n_words = None n_words = None
word_diff_report = "" word_diff_report = ""
for k, gt in enumerate(os.listdir(gt_dir)): if gt_suffix is not None and ocr_suffix is not None:
# Find a match by replacing the suffix gt_ocr_files = find_gt_and_ocr_files(gt_dir, gt_suffix, ocr_dir, ocr_suffix)
ocr = removesuffix(gt, gt_suffix) + ocr_suffix else:
gt_ocr_files = find_gt_and_ocr_files_autodetect(gt_dir, ocr_dir)
gt_text = plain_extract(os.path.join(gt_dir, gt), include_filename_in_id=True) for k, (gt_fn, ocr_fn) in enumerate(gt_ocr_files):
ocr_text = plain_extract( gt_text = plain_extract(
os.path.join(ocr_dir, ocr), include_filename_in_id=True gt_fn, include_filename_in_id=True, encoding=plain_encoding
) )
gt_words = words_normalized(gt_text) ocr_text = plain_extract(
ocr_words = words_normalized(ocr_text) ocr_fn, include_filename_in_id=True, encoding=plain_encoding
)
gt_words: List[str] = list(words_normalized(gt_text))
ocr_words: List[str] = list(words_normalized(ocr_text))
# Compute CER # Compute CER
l_cer, l_n_characters = character_error_rate_n(gt_text, ocr_text) l_cer, l_n_characters = character_error_rate_n(gt_text, ocr_text)
@ -81,7 +169,7 @@ def process(gt_dir, ocr_dir, report_prefix, *, metrics=True):
joiner="", joiner="",
none="·", none="·",
score_hint=score_hint(l_cer, l_n_characters), score_hint=score_hint(l_cer, l_n_characters),
) )[0]
word_diff_report += gen_diff_report( word_diff_report += gen_diff_report(
gt_words, gt_words,
ocr_words, ocr_words,
@ -89,7 +177,7 @@ def process(gt_dir, ocr_dir, report_prefix, *, metrics=True):
joiner=" ", joiner=" ",
none="", none="",
score_hint=score_hint(l_wer, l_n_words), score_hint=score_hint(l_wer, l_n_words),
) )[0]
env = Environment( env = Environment(
loader=FileSystemLoader( loader=FileSystemLoader(
@ -123,17 +211,30 @@ def process(gt_dir, ocr_dir, report_prefix, *, metrics=True):
@click.option( @click.option(
"--metrics/--no-metrics", default=True, help="Enable/disable metrics and green/red" "--metrics/--no-metrics", default=True, help="Enable/disable metrics and green/red"
) )
def main(gt, ocr, report_prefix, metrics): @click.option("--gt-suffix", help="Suffix of GT line text files")
@click.option("--ocr-suffix", help="Suffix of OCR line text files")
@click.option(
"--plain-encoding",
default="autodetect",
help='Encoding (e.g. "utf-8") of plain text files',
)
def main(gt, ocr, report_prefix, metrics, gt_suffix, ocr_suffix, plain_encoding):
""" """
Compare the GT line text directory against the OCR line text directory. Compare the GT line text directory against the OCR line text directory.
This assumes that the GT line text directory contains textfiles with a common This assumes that the GT line text directory contains textfiles with a common
suffix like ".gt.txt", and the OCR line text directory contains textfiles with suffix like ".gt.txt", and the OCR line text directory contains textfiles with
a common suffix like ".some-ocr.txt". The text files also need to be paired, a common suffix like ".some-ocr.txt". The text files also need to be paired,
i.e. the GT file "line001.gt.txt" needs to match a file "line001.some-ocr.txt" i.e. the GT filename "line001.gt.txt" needs to match a filename
in the OCT lines directory. "line001.some-ocr.txt" in the OCR lines directory.
The GT and OCR directories are usually round truth line texts and the results of GT and OCR directories may contain line text files in matching subdirectories,
e.g. "GT/goethe_faust/line1.gt.txt" and "OCR/goethe_faust/line1.pred.txt".
GT and OCR directories can also be the same directory, but in this case you need
to give --gt-suffix and --ocr-suffix explicitly.
The GT and OCR directories are usually ground truth line texts and the results of
an OCR software, but you may use dinglehopper to compare two OCR results. In an OCR software, but you may use dinglehopper to compare two OCR results. In
that case, use --no-metrics to disable the then meaningless metrics and also that case, use --no-metrics to disable the then meaningless metrics and also
change the color scheme from green/red to blue. change the color scheme from green/red to blue.
@ -142,9 +243,19 @@ def main(gt, ocr, report_prefix, metrics):
$REPORT_PREFIX defaults to "report". The reports include the character error $REPORT_PREFIX defaults to "report". The reports include the character error
rate (CER) and the word error rate (WER). rate (CER) and the word error rate (WER).
It is recommended to specify the encoding of the text files, for example with
--plain-encoding utf-8. If this option is not given, we try to auto-detect it.
""" """
initLogging() initLogging()
process(gt, ocr, report_prefix, metrics=metrics) process(
gt,
ocr,
report_prefix,
metrics=metrics,
gt_suffix=gt_suffix,
ocr_suffix=ocr_suffix,
plain_encoding=plain_encoding,
)
if __name__ == "__main__": if __name__ == "__main__":

View file

@ -5,10 +5,13 @@ from typing import Dict, Iterator, Optional
import chardet import chardet
from lxml import etree as ET from lxml import etree as ET
from lxml.etree import XMLSyntaxError from lxml.etree import XMLSyntaxError
from ocrd_utils import getLogger
from uniseg.graphemecluster import grapheme_clusters from uniseg.graphemecluster import grapheme_clusters
from .extracted_text import ExtractedText, normalize_sbb from .extracted_text import ExtractedText, normalize_sbb
log = getLogger("processor.OcrdDinglehopperEvaluate")
def alto_namespace(tree: ET._ElementTree) -> Optional[str]: def alto_namespace(tree: ET._ElementTree) -> Optional[str]:
"""Return the ALTO namespace used in the given ElementTree. """Return the ALTO namespace used in the given ElementTree.
@ -149,7 +152,7 @@ def detect_encoding(filename):
return chardet.detect(open(filename, "rb").read(1024))["encoding"] return chardet.detect(open(filename, "rb").read(1024))["encoding"]
def plain_extract(filename, include_filename_in_id=False): def plain_extract(filename, include_filename_in_id=False, encoding="autodetect"):
id_template = "{filename} - line {no}" if include_filename_in_id else "line {no}" id_template = "{filename} - line {no}" if include_filename_in_id else "line {no}"
def make_segment(no, line): def make_segment(no, line):
@ -163,7 +166,14 @@ def plain_extract(filename, include_filename_in_id=False):
clusters, clusters,
) )
if encoding == "autodetect":
fileencoding = detect_encoding(filename) fileencoding = detect_encoding(filename)
log.warning(
f"Autodetected encoding as '{fileencoding}'"
", it is recommended to specify it explicitly with --plain-encoding"
)
else:
fileencoding = encoding
with open(filename, "r", encoding=fileencoding) as f: with open(filename, "r", encoding=fileencoding) as f:
return ExtractedText( return ExtractedText(
None, None,
@ -175,11 +185,11 @@ def plain_extract(filename, include_filename_in_id=False):
# XXX hardcoded SBB normalization # XXX hardcoded SBB normalization
def plain_text(filename): def plain_text(filename, encoding="autodetect"):
return plain_extract(filename).text return plain_extract(filename, encoding=encoding).text
def extract(filename, *, textequiv_level="region"): def extract(filename, *, textequiv_level="region", plain_encoding="autodetect"):
"""Extract the text from the given file. """Extract the text from the given file.
Supports PAGE, ALTO and falls back to plain text. Supports PAGE, ALTO and falls back to plain text.
@ -187,7 +197,7 @@ def extract(filename, *, textequiv_level="region"):
try: try:
tree = ET.parse(filename) tree = ET.parse(filename)
except (XMLSyntaxError, UnicodeDecodeError): except (XMLSyntaxError, UnicodeDecodeError):
return plain_extract(filename) return plain_extract(filename, encoding=plain_encoding)
try: try:
return page_extract(tree, textequiv_level=textequiv_level) return page_extract(tree, textequiv_level=textequiv_level)
except ValueError: except ValueError:

View file

@ -1,5 +1,5 @@
{ {
"version": "0.10.1", "version": "0.11.0",
"git_url": "https://github.com/qurator-spk/dinglehopper", "git_url": "https://github.com/qurator-spk/dinglehopper",
"dockerhub": "ocrd/dinglehopper", "dockerhub": "ocrd/dinglehopper",
"tools": { "tools": {
@ -25,6 +25,11 @@
"enum": ["region", "line"], "enum": ["region", "line"],
"default": "region", "default": "region",
"description": "PAGE XML hierarchy level to extract the text from" "description": "PAGE XML hierarchy level to extract the text from"
},
"plain_encoding": {
"type": "string",
"default": "autodetect",
"description": "Encoding (e.g. \"utf-8\") of plain text files"
} }
} }
} }

View file

@ -26,6 +26,7 @@ class OcrdDinglehopperEvaluate(Processor):
assert self.parameter assert self.parameter
metrics = self.parameter["metrics"] metrics = self.parameter["metrics"]
textequiv_level = self.parameter["textequiv_level"] textequiv_level = self.parameter["textequiv_level"]
plain_encoding = self.parameter["plain_encoding"]
# wrong number of inputs: let fail # wrong number of inputs: let fail
gt_file, ocr_file = input_files gt_file, ocr_file = input_files
@ -52,6 +53,7 @@ class OcrdDinglehopperEvaluate(Processor):
self.output_file_grp, self.output_file_grp,
metrics=metrics, metrics=metrics,
textequiv_level=textequiv_level, textequiv_level=textequiv_level,
plain_encoding=plain_encoding,
) )
# Add reports to the workspace # Add reports to the workspace

View file

@ -0,0 +1 @@
This is a test.

View file

@ -0,0 +1 @@
Another test.

View file

@ -0,0 +1 @@
Tis is a test.

View file

@ -0,0 +1 @@
AnÖther test.

View file

@ -0,0 +1 @@
This is a test.

View file

@ -0,0 +1 @@
Tis is a test.

View file

@ -0,0 +1 @@
Another test.

View file

@ -0,0 +1 @@
AnÖther test.

View file

@ -0,0 +1 @@
This is a test.

View file

@ -0,0 +1 @@
Another test.

View file

@ -0,0 +1 @@
Tis is a test.

View file

@ -0,0 +1 @@
AnÖther test.

View file

@ -0,0 +1,61 @@
import json
import os.path
import re
import pytest
from ..cli_line_dirs import process
from .util import working_directory
data_dir = os.path.join(os.path.dirname(os.path.abspath(__file__)), "data")
@pytest.mark.integration
def test_cli_line_dirs_basic(tmp_path):
"""Test that the cli/process() produces a good report"""
with working_directory(tmp_path):
gt_dir = os.path.join(data_dir, "line_dirs/basic/gt")
ocr_dir = os.path.join(data_dir, "line_dirs/basic/ocr")
process(gt_dir, ocr_dir, "report")
with open("report.json", "r") as jsonf:
print(jsonf.read())
with open("report.json", "r") as jsonf:
j = json.load(jsonf)
assert j["cer"] == pytest.approx(0.1071429)
assert j["wer"] == pytest.approx(0.5)
@pytest.mark.integration
def test_cli_line_dirs_basic_report_diff(tmp_path):
"""Test that the cli/process() produces a report wiff char+word diff"""
with working_directory(tmp_path):
gt_dir = os.path.join(data_dir, "line_dirs/basic/gt")
ocr_dir = os.path.join(data_dir, "line_dirs/basic/ocr")
process(gt_dir, ocr_dir, "report")
with open("report.html", "r") as htmlf:
html_report = htmlf.read()
# Counting GT lines in the diff
assert len(re.findall(r"gt.*l\d+-cdiff", html_report)) == 2
assert len(re.findall(r"gt.*l\d+-wdiff", html_report)) == 2
@pytest.mark.integration
def test_cli_line_dirs_merged(tmp_path):
"""Test that the cli/process() produces a good report"""
with working_directory(tmp_path):
gt_dir = os.path.join(data_dir, "line_dirs/merged")
ocr_dir = os.path.join(data_dir, "line_dirs/merged")
process(
gt_dir, ocr_dir, "report", gt_suffix=".gt.txt", ocr_suffix=".some-ocr.txt"
)
with open("report.json", "r") as jsonf:
print(jsonf.read())
with open("report.json", "r") as jsonf:
j = json.load(jsonf)
assert j["cer"] == pytest.approx(0.1071429)
assert j["wer"] == pytest.approx(0.5)

View file

@ -1,4 +1,5 @@
import json import json
import re
import pytest import pytest
@ -40,3 +41,25 @@ def test_cli_json_cer_is_infinity(tmp_path):
with open("report.json", "r") as jsonf: with open("report.json", "r") as jsonf:
j = json.load(jsonf) j = json.load(jsonf)
assert j["cer"] == pytest.approx(float("inf")) assert j["cer"] == pytest.approx(float("inf"))
@pytest.mark.integration
def test_cli_html(tmp_path):
"""Test that the cli/process() yields complete HTML report"""
with working_directory(tmp_path):
with open("gt.txt", "w") as gtf:
gtf.write("AAAAA")
with open("ocr.txt", "w") as ocrf:
ocrf.write("AAAAB")
process("gt.txt", "ocr.txt", "report")
with open("report.html", "r") as htmlf:
html_report = htmlf.read()
print(html_report)
assert re.search(r"CER: 0\.\d+", html_report)
assert re.search(r"WER: 1\.0", html_report)
assert len(re.findall("gt.*cdiff", html_report)) == 1
assert len(re.findall("gt.*wdiff", html_report)) == 1

View file

@ -0,0 +1,71 @@
import os
from ..cli_line_dirs import find_gt_and_ocr_files, find_gt_and_ocr_files_autodetect
data_dir = os.path.join(os.path.dirname(os.path.abspath(__file__)), "data")
def test_basic():
"""Test the dumb method: User gives directories and suffixes."""
pairs = list(
find_gt_and_ocr_files(
os.path.join(data_dir, "line_dirs/basic/gt"),
".gt.txt",
os.path.join(data_dir, "line_dirs/basic/ocr"),
".some-ocr.txt",
)
)
assert len(pairs) == 2
def test_basic_autodetect():
"""Test autodetect: User gives directories, suffixes are autodetected if possible"""
pairs = list(
find_gt_and_ocr_files_autodetect(
os.path.join(data_dir, "line_dirs/basic/gt"),
os.path.join(data_dir, "line_dirs/basic/ocr"),
)
)
assert len(pairs) == 2
def test_subdirs():
"""Test the dumb method: Should also work when subdirectories are involved."""
pairs = list(
find_gt_and_ocr_files(
os.path.join(data_dir, "line_dirs/subdirs/gt"),
".gt.txt",
os.path.join(data_dir, "line_dirs/subdirs/ocr"),
".some-ocr.txt",
)
)
assert len(pairs) == 2
def test_subdirs_autodetect():
"""Test the autodetect method: Should also work when subdirectories are involved."""
pairs = list(
find_gt_and_ocr_files_autodetect(
os.path.join(data_dir, "line_dirs/subdirs/gt"),
os.path.join(data_dir, "line_dirs/subdirs/ocr"),
)
)
assert len(pairs) == 2
def test_merged():
"""Test the dumb method: GT and OCR texts are in the same directories."""
pairs = list(
find_gt_and_ocr_files(
os.path.join(data_dir, "line_dirs/merged"),
".gt.txt",
os.path.join(data_dir, "line_dirs/merged"),
".some-ocr.txt",
)
)
assert len(pairs) == 2

View file

@ -182,3 +182,15 @@ def test_plain(tmp_path):
result = plain_text("ocr.txt") result = plain_text("ocr.txt")
expected = "First, a line.\nAnd a second line." expected = "First, a line.\nAnd a second line."
assert result == expected assert result == expected
def test_plain_BOM(tmp_path):
"""Test that plain text files with BOM are read correctly."""
BOM = "\ufeff"
with working_directory(tmp_path):
with open("ocr.txt", "w") as ocrf:
ocrf.write(BOM + "First, a line.\nAnd a second line.\n")
result = plain_text("ocr.txt")
expected = "First, a line.\nAnd a second line."
assert result == expected