after selecting the optimum angle on the original
search range, narrow down around in the vicinity
with half the range (adding computational costs,
but gaining precision)
when passing the text region mask, do not apply erosion only
if there are more than 2 columns, but iff `not erosion_hurts`
(consistent with `find_num_col`'s expectations and making
it as easy to find the column gaps on 1 and 2-column pages
as on multi-column pages)
- `find_number_of_columns_in_document`: retain vertical separators
and pass to `find_num_col` for each vertical split
- `return_boxes_of_images_by_order_of_reading_new`: reconstruct
the vertical separators from the segmentation mask and the separator
bboxes; pass it on to `find_num_col` everywhere
- `return_boxes_of_images_by_order_of_reading_new`: no need to
try-catch `find_num_col` anymore
- `return_boxes_of_images_by_order_of_reading_new`: when a vertical
split has too few columns,
* do not raise but lower the threshold `multiplier` responsible for
allowing gaps as column boundaries
* do not pass the `num_col_classifier` (i.e. expected number of
resulting columns) of the entire page to the iterative
`find_num_col` for each existing column, but only the portion
of that span
when searching for gaps between text regions, consider the vertical
separator mask (if given): add the vertical sum of vertical separators
to the peak scores (making column detection more robust if still slighly
skewed or partially obscured by multi-column regions, but fg seps are
present)
- when analysing regions spanning across columns,
disregard tiny regions (smaller than half the median size)
- if a region spans across columns just by a tiny fraction,
and therefore is not good enough for a multi-col separator,
then it should also not be good enough for a multi-col box
maker
- avoid unnecessary `fillPoly` (we already have the mask)
- do not merge hseps if vseps interfere
- remove old criterion (based on total length of hseps)
- create new criterion (no x overlap and x close to each other)
- rename identifiers:
* `sum_dis` → `sum_xspan`
* `diff_max_min_uniques` → `tot_xspan`
* np.std / np.mean → `dev_xspan`
- remove rule cutting around the center of crossing seps
(which is unnecessary and creates small isolated seps
at the center, unrelated to the actual crossing points)
- create rule cutting hseps by vseps _prior_ to merging
- `do_order_of_regions`: simplify aggregating per-box orders
for paragraphs and headings to overall order passed to
`xml_reading_order`; no need for `order_and_id_of_texts`,
no need to return `id_of_texts_tot`
- `do_order_of_regions_with_model`: no need to return `region_ids`
- writer: no need to pass `id_of_texts_tot` in `build_pagexml`
(because the latter does not preserve coordinates;
it scales, even when resizing the image;
this caused coordinate problems when matching deskewed contours)
- reduce `sigma` for smoothing of input to `find_peaks`
(so we get deeper gaps between columns)
- allow column boundaries closer to the margins
(50 instead of 100 or 200 px, 170 instead of 370 px)
- allow column boundaries closer to each other
(300 instead of 400 px)
- add a secondary `grenze` criterion for depth of gap
(relative to lowest minimum, if that is smaller than
the old criterion relative to lowest maximum)
- for calls to `find_num_col` within parts of a page,
do allow unbalanced column boundaries