No description
Find a file
Robert Sachunsky 0dc5bdac2e generalize to other textLineOrder/readingDirection:
- don't ignore regions / lines / words that are not top-to-bottom and left-to-right;
  instead, only ignore regions that are not top-to-bottom OR bottom-to-top and
  lines or words that are not left-to-right OR right-to-left
  (thus, applying each on its appropriate level, and allowing reverse sorting,
   but still discounting rotated layouts)
- don't enter segments if they have no more than 1 child
- improve logging: show failed attempts on debug, show pageIds throughout
2019-11-29 12:31:08 +01:00
examples 🐛 Use ocrd-repair-inconsistencies where appropriate 2019-11-26 17:11:51 +01:00
ocrd_repair_inconsistencies generalize to other textLineOrder/readingDirection: 2019-11-29 12:31:08 +01:00
.gitignore 🎉 Initial commit 2019-11-22 16:18:05 +01:00
README.md 🐛 Use ocrd-repair-inconsistencies where appropriate 2019-11-26 17:11:51 +01:00
requirements.txt 🎉 Initial commit 2019-11-22 16:18:05 +01:00
setup.py 🐛 Use ocrd-repair-inconsistencies where appropriate 2019-11-26 17:11:51 +01:00

ocrd_repair_inconsistencies

Automatically re-order lines, words and glyphs to become textually consistent with their parents.

PAGE-XML elements with textual annotation are re-ordered by their centroid coordinates in top-to-bottom/left-to-right fashion iff such re-ordering fixes the inconsistency between their appropriately concatenated TextEquiv texts with their parent's TextEquiv text.

This processor does not affect ReadingOrder between regions, just the order of the XML elements below the region level, and only if not contradicting the annotated textLineOrder/readingDirection.

We wrote this as a one-shot script to fix some files. Use with caution.

Example usage

For example, use this fix script:

#!/bin/bash
set -e

tmp_fg=FIXED_$RANDOM

ocrd-repair-inconsistencies -I OCR-D-GT-PAGE -O $tmp_fg

for f in "$tmp_fg"/*; do
  g="OCR-D-GT-PAGE/OCR-D-GT-PAGE_${f#${tmp_fg}/${tmp_fg}_}"
  cp "$f" "$g"
done

ocrd workspace remove-group -rf $tmp_fg
rmdir $tmp_fg