- drop connected components analysis to test overlaps between
horizontal separators and (horizontal) neighbours (introduced
in ab17a927)
- instead of converting headings to topline and baseline during
`find_number_of_columns_in_document` (introduced in 9f1595d7),
add them to the matrix unchanged, but mark as extra type
(besides horizontal and vertical separtors)
- convert headings to toplines and baselines no earlier than in
`return_boxes_of_images_by_order_of_reading_new`
- for both headings and horizontal separators, if they already
span multiple columns, check if they would overlap (horizontal)
neighbours by looking at successively larger (left and right)
intervals of columns (and pick the largest elongation which
does not introduce any overlaps)
when y slice (`top:bot`) is not a significant part of the page,
viz. less than 22% (as in `find_number_of_columns_in_document`),
avoid forcing `find_num_col` to reach `num_col_classifier`
(allows large headers not to be split up and thus better ordered)
simplify and document
- simplify
- rename identifiers to make readable:
- `y_sep` → `y_mid` (because the cy gets passed)
- `y_diff` → `y_max` (because the ymax gets passed)
- array instead of list operations
- add docstring and in-line comments
- return (zero-length) numpy array instead of empty list
- when handling lines without mother,
and biggest line already accounts for all columns,
but some are too close to the top and therefore must be removed,
avoid invalidating `biggest` index, causing `IndexError`
- remove try-catch (now unnecessary)
- array instead of list operations
regarding `splitter_y` result, for headings, instead of cutting right
through them via center line, add their toplines and baselines as if
they were horizontal separators
extend horizontal separators to full img width if they do not overlap
any other regions
(only as regards to returned `splitter_y` result,
but without changing returned separators mask)