You cannot select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
 
 
Mike Gerber 94c482f737
🕸 README: Mention archival of the project
3 months ago
examples 🧹 examples/fix-ocr-d-gt-page.sh: rmdir after ocrd workspace remove-group is not necessary anymore 4 years ago
ocrd_repair_inconsistencies Fix for Python 3.10 and newer 10 months ago
.gitignore add simple Makefile 4 years ago
.pylintrc pylint karma 4 years ago
LICENSE ⚖ Add a license (Fixes #9) 4 years ago
Makefile add simple Makefile 4 years ago
README.md 🕸 README: Mention archival of the project 3 months ago
ocrd-tool.json symlink ocrd-tool.json to repo root 4 years ago
requirements.txt 🎉 Initial commit 4 years ago
setup.py no version then 4 years ago

README.md

[!CAUTION] This was a one-off script, useful to solve a specific problem. We do not maintain it anymore, but in case you want to use it, we appreciate an e-mail to mike.gerber@sbb.spk-berlin.de 🕸

ocrd_repair_inconsistencies

Automatically re-order lines, words and glyphs to become textually consistent with their parents.

Introduction

PAGE-XML elements with textual annotation are re-ordered by their centroid coordinates iff such re-ordering fixes the inconsistency between their appropriately concatenated TextEquiv texts with their parent's TextEquiv text.

If TextEquiv is missing, skip the respective elements.

Where available, respect the annotated visual order:

  • For regions vs lines, sort in top-to-bottom fashion, unless another textLineOrder is annotated.
    (Both left-to-right and right-to-left will be skipped currently.)
  • For lines vs words and words vs glyphs, sort in left-to-right fashion, unless another readingDirection is annotated.
    (Both top-to-bottom and bottom-to-top will be skipped currently.)

This processor does not affect ReadingOrder between regions, just the order of the XML elements below the region level, and only if not contradicting the annotated textLineOrder/readingDirection.

We wrote this as a one-shot script to fix some files. Use with caution.

Installation

(In your venv, run:)

make deps     # or pip install -r requirements.txt
make install  # or pip install .

Usage

Offers the following user interfaces:

OCR-D processor CLI ocrd-repair-inconsistencies

To be used with PageXML documents in an OCR-D annotation workflow.

Example

Use the following script to repair OCR-D-GT-PAGE annotation in workspaces, and then replace it with the output on success:

#!/bin/bash
set -e

tmp_fg=FIXED_$RANDOM

ocrd-repair-inconsistencies -I OCR-D-GT-PAGE -O $tmp_fg

for f in "$tmp_fg"/*; do
  g="OCR-D-GT-PAGE/OCR-D-GT-PAGE_${f#${tmp_fg}/${tmp_fg}_}"
  cp "$f" "$g"
done

ocrd workspace remove-group -rf $tmp_fg