Compare commits

...

45 Commits

Author SHA1 Message Date
Mike Gerber f6dfb77f94 🐛 pyproject.toml: Fix description 2 days ago
Mike Gerber ef817cb343 📦 v0.10.0 2 days ago
Mike Gerber b1c109baae
Merge pull request #128 from kba/v3-api
V3 api
2 days ago
Mike Gerber 13ab1ae150 🐛 Docker: Use same vendor as license for now 2 days ago
Mike Gerber d974369e13 🐛 Docker: Fix description 2 days ago
Mike Gerber b7bdca4ac8 🐛 Makefile: Make phony targets .PHONY 2 days ago
kba 831a24fc4c typo: report_prefix -> file_id 3 days ago
Konstantin Baierer f6a2c94520 ocrd_cli: but do check for existing output files
Co-authored-by: Robert Sachunsky <38561704+bertsky@users.noreply.github.com>
3 days ago
Konstantin Baierer 4162836612 ocrd_cli: no need to check fileGrp dir exists
Co-authored-by: Robert Sachunsky <38561704+bertsky@users.noreply.github.com>
3 days ago
Konstantin Baierer c0aa82d188 OCR-D processor: properly handle missing or non-downloaded GT/OCR file
Co-authored-by: Robert Sachunsky <38561704+bertsky@users.noreply.github.com>
3 days ago
kba 8c1b6d65f5 Dockerfile: build ocrd-all-tool.json 3 days ago
Mike Gerber f287386c0e 🧹Don't pin uniseg and rapidfuzz
Breakage with the newest uniseg API was fixed in master.

Can't see any issue with rapidfuzz, so removing that pin, too.
3 days ago
kba 63031b30bf Port to OCR-D/core API v3 3 days ago
Mike Gerber bf6633be02
Merge pull request #136 from qurator-spk/chore/update-liccheck
⚙  liccheck: update permissable licenses (mit-cmu, psf 2.0, iscl)
3 days ago
Mike Gerber d3aa9eb520 ⚙ liccheck: update permissable licenses (mit-cmu, psf 2.0, iscl) 3 days ago
Mike Gerber 625686f204
Merge pull request #135 from qurator-spk/chore/update-python-version
⚙  pyproject.toml: Update supported Python version
3 days ago
Mike Gerber ce7886af23 ⚙ pyproject.toml: Update supported Python version 3 days ago
Mike Gerber a09a624bde
Merge pull request #132 from qurator-spk/fix/uniseg-removed-index-parameter
🐛 Fix for changed API of uniseg's word_break
3 days ago
Mike Gerber badfa9c99e ⚙ GitHub Actions: Don't test on Python 3.8 anymore 3 days ago
Mike Gerber 7f8a8dd564 🐛 Fix for changed API of uniseg's word_break 3 days ago
Mike Gerber b72d4f5af9
Merge pull request #131 from qurator-spk/chore/update-pre-commit
⚙  pre-commit: update
3 days ago
Mike Gerber 058042accb ⚙ pre-commit: update 3 days ago
Mike Gerber 071e6a8bd1
Merge pull request #120 from joschrew/dockerfile
Add Dockerfile and Makefile to create ocr-d dockerimage
6 months ago
Mike Gerber 6b82293670
Update Dockerfile
I fancy-clicked @bertsky's change suggestion, which duplicated some labels. Now fancy-clicking the fix, fingers crossed...
6 months ago
Mike Gerber 6ecf49a355
Update Dockerfile
Co-authored-by: Robert Sachunsky <38561704+bertsky@users.noreply.github.com>
6 months ago
joschrew 9c7c104dce Add Dockerfile and Makefile to create ocr-d image 7 months ago
Mike Gerber 2e6fe0c279
Merge pull request #113 from qurator-spk/python-3.13
✔ Test on Python 3.13
8 months ago
Mike Gerber 1753ed4d13 ✔ Test on Python 3.13 8 months ago
Mike Gerber 3233dbcc8f ✔ pre-commit: Add license check 9 months ago
Mike Gerber f2e290dffe 🐛 Fix --version option in OCR-D CLI 9 months ago
Mike Gerber 6d1daf1dfe Support --version option in CLI 9 months ago
Mike Gerber 27ad145c7e ⚙ pyproject.toml: Add license.file 9 months ago
Mike Gerber 2e9e88cc1e ⚙ pre-commit: Update hooks 9 months ago
Mike Gerber 129e6eb427 📦 v0.9.7 9 months ago
Mike Gerber cf998443c1 ⚙ ruff: Update settings (select → lint.select) 9 months ago
Mike Gerber 6048107889 Merge branch 'master' of https://github.com/qurator-spk/dinglehopper 9 months ago
Mike Gerber 2ee37ed4e3 🎨 Sort imports 9 months ago
Mike Gerber 521f034fba
Merge pull request #116 from stweil/master
Fix typo
9 months ago
Mike Gerber d1a2247615 ⚙ pre-commit: Update hooks 9 months ago
Mike Gerber 4047f8b6e5 🐛 Fix loading ocrd-tool.json for Python 3.12 9 months ago
Stefan Weil cd68a973cb Fix typo
Signed-off-by: Stefan Weil <sw@weilnetz.de>
11 months ago
Mike Gerber bc5818da9f ✔ GitHub Actions: Update used actions 11 months ago
Mike Gerber c91234daba ✔ GitHub Actions: Update used actions 11 months ago
Mike Gerber a534b5e28e ⚙ pre-commit: Update hooks 11 months ago
Mike Gerber b336f98271 🐛 Fix reading plain text files
As reported by @tallemeersch in gh-107, newlines were not removed for plain text files.
Fix this by stripping the lines as suggested.

Fixes gh-107.
12 months ago

@ -0,0 +1,5 @@
src/dinglehopper/tests
dist
build
*.egg-info
.git

@ -17,7 +17,7 @@ jobs:
runs-on: ubuntu-latest
steps:
- name: Checkout
uses: actions/checkout@v3
uses: actions/checkout@v4
- name: Upgrade pip
run: python3 -m pip install --upgrade pip
- name: Install setuptools
@ -32,7 +32,7 @@ jobs:
- name: Build package
run: python3 -m pip install --upgrade build && python3 -m build
- name: Upload dist
uses: actions/upload-artifact@v3
uses: actions/upload-artifact@v4
with:
name: dist
path: dist/
@ -42,7 +42,7 @@ jobs:
runs-on: ubuntu-latest
steps:
- name: Download dist
uses: actions/download-artifact@v3
uses: actions/download-artifact@v4
with:
name: dist
path: dist/
@ -61,7 +61,7 @@ jobs:
id-token: write # IMPORTANT: this permission is mandatory for trusted publishing
steps:
- name: Download dist
uses: actions/download-artifact@v3
uses: actions/download-artifact@v4
with:
name: dist
path: dist/

@ -25,18 +25,19 @@ jobs:
strategy:
fail-fast: false
matrix:
python-version: [ "3.8", "3.9", "3.10", "3.11", "3.12" ]
python-version: [ "3.9", "3.10", "3.11", "3.12", "3.13" ]
runs-on: "ubuntu-latest"
steps:
- name: Set up Python
uses: actions/setup-python@v4
uses: actions/setup-python@v5
with:
python-version: ${{ matrix.python-version }}
allow-prereleases: true
- name: Checkout
uses: actions/checkout@v3
uses: actions/checkout@v4
- name: Install possible lxml build requirements (if building from source)
run: sudo apt-get install -y libxml2-dev libxslt-dev python3-dev
@ -56,7 +57,7 @@ jobs:
cd src
python3 -m pytest --junitxml=../${{matrix.python-version}}-junit.xml -o junit_family=legacy
- name: Upload test results
uses: actions/upload-artifact@v3
uses: actions/upload-artifact@v4
if: success() || failure()
with:
name: test-results-${{matrix.python-version}}

@ -12,7 +12,7 @@ jobs:
report:
runs-on: ubuntu-latest
steps:
- uses: dorny/test-reporter@v1.7.0
- uses: dorny/test-reporter@v1
with:
artifact: /test-results-(.*)/
name: 'Tests Results - $1'

@ -1,6 +1,6 @@
repos:
- repo: https://github.com/pre-commit/pre-commit-hooks
rev: v4.6.0
rev: v5.0.0
hooks:
- id: trailing-whitespace
- id: end-of-file-fixer
@ -11,12 +11,12 @@ repos:
- id: check-ast
- repo: https://github.com/psf/black
rev: 24.4.2
rev: 25.1.0
hooks:
- id: black
- repo: https://github.com/astral-sh/ruff-pre-commit
rev: v0.4.3
rev: v0.11.5
hooks:
- args:
- --fix
@ -24,7 +24,7 @@ repos:
id: ruff
- repo: https://github.com/pre-commit/mirrors-mypy
rev: v1.10.0
rev: v1.15.0
hooks:
- additional_dependencies:
- types-setuptools
@ -36,6 +36,12 @@ repos:
id: mypy
- repo: https://gitlab.com/vojko.pribudic.foss/pre-commit-update
rev: v0.3.1post2
rev: v0.6.1
hooks:
- id: pre-commit-update
- repo: https://github.com/dhatim/python-license-check
rev: 0.9.2
hooks:
- id: liccheck
language: system

@ -0,0 +1,38 @@
ARG DOCKER_BASE_IMAGE
FROM $DOCKER_BASE_IMAGE
ARG VCS_REF
ARG BUILD_DATE
LABEL \
maintainer="https://github.com/qurator-spk/dinglehopper/issues" \
org.label-schema.vcs-ref=$VCS_REF \
org.label-schema.vcs-url="https://github.com/qurator-spk/dinglehopper" \
org.label-schema.build-date=$BUILD_DATE \
org.opencontainers.image.vendor="qurator" \
org.opencontainers.image.title="dinglehopper" \
org.opencontainers.image.description="An OCR evaluation tool" \
org.opencontainers.image.source="https://github.com/qurator-spk/dinglehopper" \
org.opencontainers.image.documentation="https://github.com/qurator-spk/dinglehopper/blob/${VCS_REF}/README.md" \
org.opencontainers.image.revision=$VCS_REF \
org.opencontainers.image.created=$BUILD_DATE \
org.opencontainers.image.base.name=ocrd/core
ENV LANG=C.UTF-8
ENV LC_ALL=C.UTF-8
# avoid HOME/.local/share (hard to predict USER here)
# so let XDG_DATA_HOME coincide with fixed system location
# (can still be overridden by derived stages)
ENV XDG_DATA_HOME /usr/local/share
# avoid the need for an extra volume for persistent resource user db
# (i.e. XDG_CONFIG_HOME/ocrd/resources.yml)
ENV XDG_CONFIG_HOME /usr/local/share/ocrd-resources
WORKDIR /build/dinglehopper
COPY . .
COPY ocrd-tool.json .
# prepackage ocrd-tool.json as ocrd-all-tool.json
RUN ocrd ocrd-tool ocrd-tool.json dump-tools > $(dirname $(ocrd bashlib filename))/ocrd-all-tool.json
RUN make install && rm -rf /build/dinglehopper
WORKDIR /data
VOLUME /data

@ -0,0 +1,33 @@
PYTHON = python3
PIP = pip3
PYTHONIOENCODING=utf8
PYTEST_ARGS = -vv
DOCKER_BASE_IMAGE = docker.io/ocrd/core:v3.3.0
DOCKER_TAG = ocrd/dinglehopper
help:
@echo
@echo " Targets"
@echo
@echo " install Install full Python package via pip"
@echo " docker Build the ocrd/dinglehopper docker image"
# Install Python package via pip
install:
$(PIP) install .
install-dev:
$(PIP) install -e .
test:
pytest $(PYTEST_ARGS)
docker:
docker build \
--build-arg DOCKER_BASE_IMAGE=$(DOCKER_BASE_IMAGE) \
--build-arg VCS_REF=$$(git rev-parse --short HEAD) \
--build-arg BUILD_DATE=$$(date -u +"%Y-%m-%dT%H:%M:%SZ") \
-t $(DOCKER_TAG) .
.PHONY: help install install-dev test docker

@ -7,9 +7,10 @@ authors = [
{name = "Mike Gerber", email = "mike.gerber@sbb.spk-berlin.de"},
{name = "The QURATOR SPK Team", email = "qurator@sbb.spk-berlin.de"},
]
description = "The OCR evaluation tool"
description = "An OCR evaluation tool"
readme = "README.md"
requires-python = ">=3.8"
license.file = "LICENSE"
requires-python = ">=3.9"
keywords = ["qurator", "ocr", "evaluation", "ocr-d"]
dynamic = ["version", "dependencies", "optional-dependencies"]
@ -48,7 +49,7 @@ optional-dependencies.dev = {file = ["requirements-dev.txt"]}
where = ["src"]
[tool.setuptools.package-data]
dinglehopper = ["templates/*"]
dinglehopper = ["templates/*", "*.json"]
[tool.pytest.ini_options]
@ -74,5 +75,40 @@ disallow_untyped_defs = false
disallow_untyped_calls = false
[tool.ruff]
[tool.ruff.lint]
select = ["E", "F", "I"]
[tool.liccheck]
authorized_licenses = [
"bsd",
"new bsd",
"bsd license",
"new bsd license",
"simplified bsd",
"apache",
"apache 2.0",
"apache software license",
"apache software",
"apache license 2.0",
"gnu lgpl",
"lgpl with exceptions or zpl",
"GNU Library or Lesser General Public License (LGPL)",
"GNU Lesser General Public License v3 (LGPLv3)",
"GNU Lesser General Public License v2 or later (LGPLv2+)",
"mit",
"mit license",
"mit-cmu",
"python software foundation",
"psf",
"psf-2.0",
"Historical Permission Notice and Disclaimer (HPND)",
"public domain",
'The Unlicense (Unlicense)',
"isc",
"ISC License (ISCL)",
'Mozilla Public License 2.0 (MPL 2.0)',
]
unauthorized_licenses = [
"gpl v3",
]

@ -10,3 +10,5 @@ mypy
types-lxml
types-setuptools
pytest-mypy
liccheck

@ -1,13 +1,14 @@
click
jinja2
lxml
uniseg >= 0.8.0
uniseg >= 0.9.1
numpy
colorama
MarkupSafe
ocrd >= 2.65.0
ocrd >= 3.3.0
attrs
multimethod >= 1.3
tqdm
rapidfuzz >= 2.7.0
chardet
importlib_resources

@ -234,6 +234,7 @@ def process_dir(
metavar="LEVEL",
)
@click.option("--progress", default=False, is_flag=True, help="Show progress bar")
@click.version_option()
def main(
gt,
ocr,

@ -149,7 +149,7 @@ class ExtractedText:
raise ValueError("Can't have joiner without segments to join")
if self.segments is not None:
if value not in ("", " ", "\n"):
raise ValueError(f"Unexcepted segment joiner value {repr(value)}")
raise ValueError(f"Unexpected segment joiner value {repr(value)}")
@_text.validator
def is_valid_text(self, _, value):

@ -36,7 +36,7 @@ def alto_extract_lines(tree: ET._ElementTree) -> Iterator[ExtractedText]:
for line in tree.iterfind(".//alto:TextLine", namespaces=nsmap):
line_id = line.attrib.get("ID")
line_text = " ".join(
string.attrib.get("CONTENT")
string.attrib.get("CONTENT", "")
for string in line.iterfind("alto:String", namespaces=nsmap)
)
normalized_text = normalize_sbb(line_text)
@ -167,7 +167,7 @@ def plain_extract(filename, include_filename_in_id=False):
with open(filename, "r", encoding=fileencoding) as f:
return ExtractedText(
None,
[make_segment(no, line) for no, line in enumerate(f.readlines())],
[make_segment(no, line.strip()) for no, line in enumerate(f.readlines())],
"\n",
None,
None,

@ -1,17 +1,13 @@
{
"version": "0.9.6",
"version": "0.10.0",
"git_url": "https://github.com/qurator-spk/dinglehopper",
"dockerhub": "ocrd/dinglehopper",
"tools": {
"ocrd-dinglehopper": {
"executable": "ocrd-dinglehopper",
"input_file_grp_cardinality": 2,
"output_file_grp_cardinality": 1,
"description": "Evaluate OCR text against ground truth with dinglehopper",
"input_file_grp": [
"OCR-D-GT-PAGE",
"OCR-D-OCR"
],
"output_file_grp": [
"OCR-D-OCR-EVAL"
],
"categories": [
"Quality assurance"
],

@ -1,78 +1,76 @@
import json
from functools import cached_property
import os
from typing import Optional
import click
from ocrd_models import OcrdFileType
from ocrd import Processor
from ocrd.decorators import ocrd_cli_options, ocrd_cli_wrap_processor
from ocrd_utils import assert_file_grp_cardinality, getLogger, make_file_id
from pkg_resources import resource_string
from ocrd_utils import make_file_id
from .cli import process as cli_process
OCRD_TOOL = json.loads(resource_string(__name__, "ocrd-tool.json").decode("utf8"))
@click.command()
@ocrd_cli_options
def ocrd_dinglehopper(*args, **kwargs):
return ocrd_cli_wrap_processor(OcrdDinglehopperEvaluate, *args, **kwargs)
class OcrdDinglehopperEvaluate(Processor):
def __init__(self, *args, **kwargs):
kwargs["ocrd_tool"] = OCRD_TOOL["tools"]["ocrd-dinglehopper"]
super(OcrdDinglehopperEvaluate, self).__init__(*args, **kwargs)
def process(self):
assert_file_grp_cardinality(self.input_file_grp, 2, "GT and OCR")
assert_file_grp_cardinality(self.output_file_grp, 1)
@cached_property
def executable(self):
return 'ocrd-dinglehopper'
log = getLogger("processor.OcrdDinglehopperEvaluate")
def process_page_file(self, *input_files: Optional[OcrdFileType]) -> None:
assert self.parameter
metrics = self.parameter["metrics"]
textequiv_level = self.parameter["textequiv_level"]
gt_grp, ocr_grp = self.input_file_grp.split(",")
input_file_tuples = self.zip_input_files(on_error="abort")
for n, (gt_file, ocr_file) in enumerate(input_file_tuples):
if not gt_file or not ocr_file:
# file/page was not found in this group
continue
gt_file = self.workspace.download_file(gt_file)
ocr_file = self.workspace.download_file(ocr_file)
page_id = gt_file.pageId
log.info("INPUT FILES %i / %s%s", n, gt_file, ocr_file)
file_id = make_file_id(ocr_file, self.output_file_grp)
report_prefix = os.path.join(self.output_file_grp, file_id)
# Process the files
try:
os.mkdir(self.output_file_grp)
except FileExistsError:
pass
cli_process(
gt_file.local_filename,
ocr_file.local_filename,
report_prefix,
metrics=metrics,
textequiv_level=textequiv_level,
# wrong number of inputs: let fail
gt_file, ocr_file = input_files
# missing on either side: skip (zip_input_files already warned)
if not gt_file or not ocr_file:
return
# missing download (i.e. OCRD_DOWNLOAD_INPUT=false):
if not gt_file.local_filename:
if config.OCRD_MISSING_INPUT == 'ABORT':
raise MissingInputFile(gt_file.fileGrp, gt_file.pageId, gt_file.mimetype)
return
if not ocr_file.local_filename:
if config.OCRD_MISSING_INPUT == 'ABORT':
raise MissingInputFile(ocr_file.fileGrp, ocr_file.pageId, ocr_file.mimetype)
return
page_id = gt_file.pageId
file_id = make_file_id(ocr_file, self.output_file_grp)
cli_process(
gt_file.local_filename,
ocr_file.local_filename,
file_id,
self.output_file_grp,
metrics=metrics,
textequiv_level=textequiv_level,
)
# Add reports to the workspace
for report_suffix, mimetype in [
[".html", "text/html"],
[".json", "application/json"],
]:
output_file_id = file_id + report_suffix
output_file = next(self.workspace.mets.find_files(ID=output_file_id), None)
if output_file and config.OCRD_EXISTING_OUTPUT != 'OVERWRITE':
raise FileExistsError(f"A file with ID=={output_file_id} already exists {output_file} and neither force nor ignore are set")
self.workspace.add_file(
file_id=output_file_id,
file_grp=self.output_file_grp,
page_id=page_id,
mimetype=mimetype,
local_filename=file_id + report_suffix,
)
# Add reports to the workspace
for report_suffix, mimetype in [
[".html", "text/html"],
[".json", "application/json"],
]:
self.workspace.add_file(
file_id=file_id + report_suffix,
file_grp=self.output_file_grp,
page_id=page_id,
mimetype=mimetype,
local_filename=report_prefix + report_suffix,
)
if __name__ == "__main__":
ocrd_dinglehopper()

@ -177,8 +177,8 @@ def test_text():
def test_plain(tmp_path):
with working_directory(tmp_path):
with open("ocr.txt", "w") as ocrf:
ocrf.write("AAAAB")
ocrf.write("First, a line.\nAnd a second line.\n")
result = plain_text("ocr.txt")
expected = "AAAAB"
expected = "First, a line.\nAnd a second line."
assert result == expected

@ -22,11 +22,11 @@ def patch_word_break():
"""
old_word_break = uniseg.wordbreak.word_break
def new_word_break(c, index=0):
def new_word_break(c):
if 0xE000 <= ord(c) <= 0xF8FF: # Private Use Area
return uniseg.wordbreak.WordBreak.ALETTER
return uniseg.wordbreak.Word_Break.ALetter
else:
return old_word_break(c, index)
return old_word_break(c)
uniseg.wordbreak.word_break = new_word_break
global word_break_patched

Loading…
Cancel
Save