- rename `return_x_start_end_mothers_childs_and_type_of_reading_order`
→ `return_multicol_separators_x_start_end`, and drop all the analysis
pertaining to mother/child relationships and full-span separators,
also drop the separator unification rules;
instead of the latter, try to combine neighbouring separators more
generally: join column spans iff there is nothing in between
(which also necessitates passing the region mask), and keep only
one of every such redundant pair;
add the top (of each page part) as full-span separator up front,
and return separators already ordered by y
- `return_boxes_of_images_by_order_of_reading_new`:
- also pass regions with separators, so they do not have to be
reconstructed from the separator coordinates, and also contain
images and other non-text region types, when trying to elongate
separators to maximize their span (without introducing overlaps)
- determine connected components of the region mask, i.e. labels
and their respective bboxes, in order to
1. gain additional multi-column separators, if possible
2. avoid cutting through regions which do cross column boundaries
later on
- whenever adding a new bbox, first look up the label map to see if
there are any multi-column regions extending to the right of the
current column; if there are, then advance not just one column
to the right, but as many as necessary to avoid cutting through
these regions
- new core algorithm: iterate separators sorted by y and then column
by column, but whenever the next separator ends in the same column
as the current one or even further left, recurse (i.e. finish that
span first before continuing with the top iteration)
- `lines` → `seps` (to distinguish from textlines)
- `text_regions_p_1_n` → `text_regions_p_d` (because all other
deskewed variables are called like this)
- `pixel` → `label`
- drop connected components analysis to test overlaps between
horizontal separators and (horizontal) neighbours (introduced
in ab17a927)
- instead of converting headings to topline and baseline during
`find_number_of_columns_in_document` (introduced in 9f1595d7),
add them to the matrix unchanged, but mark as extra type
(besides horizontal and vertical separtors)
- convert headings to toplines and baselines no earlier than in
`return_boxes_of_images_by_order_of_reading_new`
- for both headings and horizontal separators, if they already
span multiple columns, check if they would overlap (horizontal)
neighbours by looking at successively larger (left and right)
intervals of columns (and pick the largest elongation which
does not introduce any overlaps)
when y slice (`top:bot`) is not a significant part of the page,
viz. less than 22% (as in `find_number_of_columns_in_document`),
avoid forcing `find_num_col` to reach `num_col_classifier`
(allows large headers not to be split up and thus better ordered)
simplify and document
- simplify
- rename identifiers to make readable:
- `y_sep` → `y_mid` (because the cy gets passed)
- `y_diff` → `y_max` (because the ymax gets passed)
- array instead of list operations
- add docstring and in-line comments
- return (zero-length) numpy array instead of empty list
- when handling lines without mother,
and biggest line already accounts for all columns,
but some are too close to the top and therefore must be removed,
avoid invalidating `biggest` index, causing `IndexError`
- remove try-catch (now unnecessary)
- array instead of list operations
regarding `splitter_y` result, for headings, instead of cutting right
through them via center line, add their toplines and baselines as if
they were horizontal separators
extend horizontal separators to full img width if they do not overlap
any other regions
(only as regards to returned `splitter_y` result,
but without changing returned separators mask)