You cannot select more than 25 topics
Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
0dc5bdac2e
- don't ignore regions / lines / words that are not top-to-bottom and left-to-right; instead, only ignore regions that are not top-to-bottom OR bottom-to-top and lines or words that are not left-to-right OR right-to-left (thus, applying each on its appropriate level, and allowing reverse sorting, but still discounting rotated layouts) - don't enter segments if they have no more than 1 child - improve logging: show failed attempts on debug, show pageIds throughout |
5 years ago | |
---|---|---|
examples | 5 years ago | |
ocrd_repair_inconsistencies | 5 years ago | |
.gitignore | 5 years ago | |
README.md | 5 years ago | |
requirements.txt | 5 years ago | |
setup.py | 5 years ago |
README.md
ocrd_repair_inconsistencies
Automatically re-order lines, words and glyphs to become textually consistent with their parents.
PAGE-XML elements with textual annotation are re-ordered by their centroid coordinates
in top-to-bottom/left-to-right fashion iff such re-ordering fixes the inconsistency
between their appropriately concatenated TextEquiv
texts with their parent's TextEquiv
text.
This processor does not affect ReadingOrder
between regions, just the order of the XML elements
below the region level, and only if not contradicting the annotated textLineOrder
/readingDirection
.
We wrote this as a one-shot script to fix some files. Use with caution.
Example usage
For example, use this fix script:
#!/bin/bash
set -e
tmp_fg=FIXED_$RANDOM
ocrd-repair-inconsistencies -I OCR-D-GT-PAGE -O $tmp_fg
for f in "$tmp_fg"/*; do
g="OCR-D-GT-PAGE/OCR-D-GT-PAGE_${f#${tmp_fg}/${tmp_fg}_}"
cp "$f" "$g"
done
ocrd workspace remove-group -rf $tmp_fg
rmdir $tmp_fg