mirror of
https://github.com/qurator-spk/eynollah.git
synced 2025-11-17 01:44:14 +01:00
rewrite/simplify manual reading order using recursive algorithm
- rename `return_x_start_end_mothers_childs_and_type_of_reading_order`
→ `return_multicol_separators_x_start_end`, and drop all the analysis
pertaining to mother/child relationships and full-span separators,
also drop the separator unification rules;
instead of the latter, try to combine neighbouring separators more
generally: join column spans iff there is nothing in between
(which also necessitates passing the region mask), and keep only
one of every such redundant pair;
add the top (of each page part) as full-span separator up front,
and return separators already ordered by y
- `return_boxes_of_images_by_order_of_reading_new`:
- also pass regions with separators, so they do not have to be
reconstructed from the separator coordinates, and also contain
images and other non-text region types, when trying to elongate
separators to maximize their span (without introducing overlaps)
- determine connected components of the region mask, i.e. labels
and their respective bboxes, in order to
1. gain additional multi-column separators, if possible
2. avoid cutting through regions which do cross column boundaries
later on
- whenever adding a new bbox, first look up the label map to see if
there are any multi-column regions extending to the right of the
current column; if there are, then advance not just one column
to the right, but as many as necessary to avoid cutting through
these regions
- new core algorithm: iterate separators sorted by y and then column
by column, but whenever the next separator ends in the same column
as the current one or even further left, recurse (i.e. finish that
span first before continuing with the top iteration)
This commit is contained in:
parent
95f76081d1
commit
4abc2ff572
1 changed files with 277 additions and 658 deletions
File diff suppressed because it is too large
Load diff
Loading…
Add table
Add a link
Reference in a new issue