You cannot select more than 25 topics
Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
67 lines
2.1 KiB
Markdown
67 lines
2.1 KiB
Markdown
|
|
> [!CAUTION]
|
|
> This was a one-off script, useful to solve a specific problem. We do not maintain it anymore, but in case you want to use it, we appreciate an e-mail to [mike.gerber@sbb.spk-berlin.de](mailto:mike.gerber@sbb.spk.berlin.de)
|
|
> 🕸
|
|
|
|
# ocrd_repair_inconsistencies
|
|
|
|
> Automatically re-order lines, words and glyphs to become textually consistent with their parents.
|
|
|
|
## Introduction
|
|
|
|
PAGE-XML elements with textual annotation are re-ordered by their centroid coordinates
|
|
iff such re-ordering fixes the inconsistency between their appropriately concatenated
|
|
`TextEquiv` texts with their parent's `TextEquiv` text.
|
|
|
|
If `TextEquiv` is missing, skip the respective elements.
|
|
|
|
Where available, respect the annotated visual order:
|
|
- For regions vs lines, sort in `top-to-bottom` fashion, unless another `textLineOrder` is annotated.
|
|
(Both `left-to-right` and `right-to-left` will be skipped currently.)
|
|
- For lines vs words and words vs glyphs, sort in `left-to-right` fashion, unless another `readingDirection` is annotated.
|
|
(Both `top-to-bottom` and `bottom-to-top` will be skipped currently.)
|
|
|
|
This processor does not affect `ReadingOrder` between regions, just the order of the XML elements
|
|
below the region level, and only if not contradicting the annotated `textLineOrder`/`readingDirection`.
|
|
|
|
We wrote this as a one-shot script to fix some files. Use with caution.
|
|
|
|
## Installation
|
|
|
|
(In your venv, run:)
|
|
|
|
```sh
|
|
make deps # or pip install -r requirements.txt
|
|
make install # or pip install .
|
|
```
|
|
|
|
## Usage
|
|
|
|
Offers the following user interfaces:
|
|
|
|
### [OCR-D processor](https://ocr-d.github.io/cli) CLI `ocrd-repair-inconsistencies`
|
|
|
|
To be used with [PageXML](https://github.com/PRImA-Research-Lab/PAGE-XML)
|
|
documents in an [OCR-D](https://ocr-d.github.io) annotation workflow.
|
|
|
|
### Example
|
|
|
|
Use the following script to repair `OCR-D-GT-PAGE` annotation in workspaces,
|
|
and then replace it with the output on success:
|
|
|
|
~~~sh
|
|
#!/bin/bash
|
|
set -e
|
|
|
|
tmp_fg=FIXED_$RANDOM
|
|
|
|
ocrd-repair-inconsistencies -I OCR-D-GT-PAGE -O $tmp_fg
|
|
|
|
for f in "$tmp_fg"/*; do
|
|
g="OCR-D-GT-PAGE/OCR-D-GT-PAGE_${f#${tmp_fg}/${tmp_fg}_}"
|
|
cp "$f" "$g"
|
|
done
|
|
|
|
ocrd workspace remove-group -rf $tmp_fg
|
|
~~~
|