Commit graph

1488 commits

Author SHA1 Message Date
Robert Sachunsky
d2654af6b8
Merge cbb3be0e01 into c9f6aa35b2 2026-04-30 14:20:54 +00:00
Robert Sachunsky
cbb3be0e01 add diagnostic plotting for prediction masking (commented) 2026-04-30 16:12:00 +02:00
Robert Sachunsky
33c055389d bold run_single refactoring (predict segmentation on cropped img)…
- move `extract_page()` to the start (right after enhancement),
  so early layout and textline model prediction sees cropped
  image
- `extract_page()`: also return page mask
- `get_early_layout()`:
  * use cropped image
  * also run optional table prediction here,
    map table label and confidence already
    (so no need to pass these arrays everywhere)
  * suppress all non-text type regions in textline mask
  * also return text+table mask
    (so no need to reconstruct it everywhere)
- apply page mask to textline mask and early layout result
  (i.e. suppress areas beyond border contour)
- `run_graphics_and_columns()`:
  * rename → `run_columns()`
  * no table prediction here
  * no page extraction here
  * no page cropping+masking here
  * no textline mask suppression here
- `run_graphics_and_columns_without_layout()`: drop
  (not needed anymore)
- `run_marginals()` vs. `get_marginals()`: extract
  `text_mask` internally from early layout
- early page cropping for col-classifier:
  also use cropped image in input binarization mode
- early page cropping for col-classifier:
  get external contours instead of indiscriminate tree
- writer: skip layout mode now also uses cropped coordinates
  (so drop kwarg for it)
2026-04-30 16:12:00 +02:00
Robert Sachunsky
7e7cc6a801 do_order_of_regions(): use region mask instead of textline mask…
for local (within-box) ordering of region contours, use the same
text mask (merely eroded) as for the contour extraction itself:
the text+table+drop mask from early+full layout prediction,
rather than the textline mask, because the latter may be empty
in some boxes and is unlikely to be more useful than the region
mask itself
2026-04-30 16:11:59 +02:00
Robert Sachunsky
63df9be4db find_number_of_columns_in_document(): pass in (reuse) masks 2026-04-30 16:11:59 +02:00
Robert Sachunsky
da9e00cfe5 consistently handle textline mask with respect to drop-capital mask…
- suppress drop-capital in textline mask for textline contours
- elevate drop-capital in textline mask for reading order boxes
2026-04-30 16:11:59 +02:00
Robert Sachunsky
2641171fb1 return_boxes_...order_of_reading...: avoid negative slices…
fix rare bug when horizontal separators are detected
by the very top (of a major vertical part of the page),
causing box intervals to become negative
2026-04-30 16:11:59 +02:00
Robert Sachunsky
6a92f0d49c make get_deskewed_masks() unconditional, call only when needed 2026-04-30 16:11:59 +02:00
Robert Sachunsky
52eb4c9a0a move label definition and deskewing cancellation up 2026-04-30 16:11:59 +02:00
Robert Sachunsky
fa882e1dbe move run_boxes_order() call to RO section of run_single() 2026-04-30 16:11:59 +02:00
Robert Sachunsky
d88bd485ff get_slopes*(): does not need passing boxes separately 2026-04-30 16:11:59 +02:00
Robert Sachunsky
869646cbf5 get_full_layout() does not need the textline mask 2026-04-30 16:11:59 +02:00
Robert Sachunsky
b5bc161a4c extract_page(): get external contours instead of indiscriminate tree 2026-04-30 16:11:59 +02:00
Robert Sachunsky
287bebde0d get_marginals(): fix height factor for mask resizing 2026-04-30 16:11:59 +02:00
Robert Sachunsky
a031d590b8 get_marginals(): do allow both left and right point (f/u 4bdea39)…
(as there are valid cases where both left and right marginalia
 is present) follow-up 4bdea39 by re-allowing left point _and_
right point - but still score-based, and not if very asymmetric
2026-04-30 16:11:59 +02:00
Robert Sachunsky
9571ce3474 get_marginals(): reduce indentation 2026-04-30 16:11:52 +02:00
Robert Sachunsky
c18deb0722 drop relabelling all marginalia to main if no main (now unnecessary) 2026-04-30 16:09:03 +02:00
Robert Sachunsky
1f6db34adf run/get_marginals(): simplify and speed up…
- `get_marginals` modifies region labels in-place anyways,
  so no need for retval
- de/rotate only inside `get_marginals` (for consistency)
- return early if no marginals detected
- `run_marginals`: only useful in 1 or 2 columns, so keep to
  that conditional branch; allows avoiding unnecessary resizing
  of images to and fro
- rename `text_regions_p_1` → `text_regions_p`
2026-04-30 16:09:03 +02:00
Robert Sachunsky
45a43f7e5e get_marginals(): fixup point_right fallback 2026-04-30 16:08:15 +02:00
Robert Sachunsky
68ceeec764 get_marginals(): improve contour assignment…
- use undeskewed mask for contour comparisons
  instead of deskewed mask (less precise)
- rename `text_with_lines` → `text_mask_d`
- rename `mask_marginals` → `main_mask_d`
- rename `text_regions` → `early_layout`
- rename `...textline...` → `...text...`
2026-04-25 03:06:34 +02:00
Robert Sachunsky
6d55d0b87b get_marginals(): improve peak point threshold criterion…
in search of valid peaks (gaps between text columns),
- drop absolute values for minimum gap depth
  (likely crafted for some fixed resolution examples)
- instead, use criterion relative to maximum column depth
  and page height (trying to loosely approximate the prior
  constants, albeit somewhat more permissive)
2026-04-25 02:23:16 +02:00
Robert Sachunsky
4bdea39c98 get_marginals(): improve left/right point selection…
in search of valid (above threshold) peaks:
- do not just pick right-most left and left-most right span;
- instead,
  * if no peaks on the left, then only search right
  * if no peaks on the right, then only search left
  * if peaks on both sides, then only better side
    (so never return marginals on both sides!)
  * use scoring for peaks that reflects their peak
    prominence and peak height (but keep positional
    range constraints for what constitues left and right)
2026-04-25 01:59:48 +02:00
Robert Sachunsky
70bf461c30 get_marginals(): simplify, improve…
- rename `thickness_along_y_percent` →
  `max_textline_thickness_percent`
- rename `marginlas_should_be_main_text` →
  `main_text_should_be_marginals`
- constrain `find_peaks()` by prominence and distance
- simplify (a lot)
- add comments for possible improvements
  and for plotting
2026-04-25 01:52:21 +02:00
Robert Sachunsky
bb092364af get_slopes_and_deskew_new_light2: estimate slopes here, too…
extract slopes from minimal bounding rectangles of textlines,
using heuristics on aspect ratios, lengths and angles
2026-04-24 15:27:29 +02:00
Robert Sachunsky
c478c03db4 avoid deskewed contour matching w/ -romb 2026-04-24 15:27:29 +02:00
Robert Sachunsky
998ee2ecee get_textlines_of_a_textregion_sorted: simplify 2026-04-24 15:27:29 +02:00
Robert Sachunsky
be61875d6e get_textlines_of_a_textregion_sorted: w-h instead of w/h test 2026-04-24 15:27:29 +02:00
Robert Sachunsky
9723dfeb73 writer: also annotate col-classifier result…
both notations:
- in `/PcGts/Page/@custom` (CSS-style)
- in `/PcGts/Metadata/Comment` (qurator-style)
2026-04-24 15:27:29 +02:00
Robert Sachunsky
e3720d6623 writer: also annotate page-level deskewing result 2026-04-24 15:27:29 +02:00
Robert Sachunsky
2da718f76f writer, do_work_of_slopes*: drop passing bboxes around
(needed no more)
2026-04-24 15:27:29 +02:00
Robert Sachunsky
b792324c5b do_work_of_slopes_new_curved (if angle >45°): simplify, improve…
- use new `rotate_image_enlarge` instead of
  custom (insufficient) padding w/ `rotate_image`
- get external contours instead of tree
  (without checking hierarchy afterwards)
- use largest textline contours by area instead of
  longest polygon path
- always use `separate_lines` (but without its incorrect
  angle/offset calculations) instead of `separate_lines_vertical_cont`
- calculate coordinate transformation (shift, angle)
  for all cases (including >45°)
- simplify
2026-04-24 15:27:29 +02:00
Robert Sachunsky
dbdb6d0d53 rotate: rm unused failed variants, add new rotate_image_enlarge
(correct version that enlarges canvas instead of clipping corners,
 using only OpenCV)
2026-04-24 15:27:29 +02:00
Robert Sachunsky
d257869d83 do_work_of_slopes_new_curved (if angle <45°): simplify, improve…
- use relative images, cropped to parent bbox (faster)
- no `scale` parameter (unused)
- use largest textline contours by area instead of first
- simplify
2026-04-24 15:27:29 +02:00
Robert Sachunsky
0dce1f24d2 do_work_of_slopes_new_curved: improve deskewing…
- return early if textline mask is empty
- intersect textline mask with parent mask
  (so neighbouring, truncated textlines
   will not interfere)
- fix bug when resulting angle is small:
  rather, compare with page angle
- if there is more than 1 line in the region,
  * use median instead of mean to estimate y_diff
  * if height dominates over width and x_diff
    over y_diff, then assume 90°: transpose image,
    deskew on that, then add 90° to result
- otherwise instead of just using page angle,
  try to estimate single-line angle by approximating
  slope of linear x-y regression on mask image;
  again, if height dominates over width, then
  assume +90° and use transposed image
- drop unused `scale` param
2026-04-24 15:27:29 +02:00
Robert Sachunsky
97d9b0ea50 small_textlines_to_parent_adherence2: simplify, improve…
- when merging large line with small lines,
  don't use first new contour but largest
- get external contours instead of tree
  (without checking hierarchy afterwards)
- simplify
2026-04-24 15:27:29 +02:00
Robert Sachunsky
0735cb9d2b filter_contours_without_textline_inside: also filter slopes 2026-04-24 15:27:29 +02:00
Robert Sachunsky
fa8340dbb4 -cl: also filter textregions without textlines here 2026-04-24 15:27:29 +02:00
Robert Sachunsky
4a6d3968f9 major run_single refactoring…
- rename `get_regions()` → `get_early_layout()`
- split up `run_boxes_no/full_layout()` into shared
  * `get_full_layout()` (for lapping mapping,
    table decoding and optional full model prediction)
  * `get_deskewed_masks()` (for de-rotation)
  * extraction of various region types (polygons and confidences)
  * `run_boxes_order()` (for column detection and box ordering)
- rename `contours_tables` → `polygons_of_tables`

This further reduces redundant code, avoids splitting up the same
functionality across different places depending on mode etc.
2026-04-24 15:27:29 +02:00
Robert Sachunsky
dfb40f4a49 hsep fusion: avoid zero division if zero overlap 2026-04-24 15:27:29 +02:00
Robert Sachunsky
b63e073121 skip deskewing if no textlines 2026-04-24 15:27:29 +02:00
Robert Sachunsky
7b5aa2a1f6 more run_single refactoring…
- `run_single`: re-use `return_contours_of_interested_region`
  for extraction and filtering of text region contours
- `run_single`: isolate new function `match_deskewed_contours`
- `run_single`: apply dilation afterwards
- rename `contours_only_text_parent_d_ordered` → `polygons_of_textregions_d`
- rename `contours_only_text_parent` → `polygons_of_textregions`
- rename `contours_only_text_parent_h` → `polygons_of_textregions_h`
- `do_work_of_slopes_new_curved` and `get_slopes_and_deskew_new_curved`:
   no need for `mask_texts_only` array arg
- `filter_contours_inside_a_bigger_one`: no need for `image` as array arg,
  simplify
- `split_textregion_main_vs_head`: simplify, re-order arguments
  and return tuple logically
- if no main text regions are found, just convert marginals to main text
  and continue normally instead of stopping early w/ empty marginals (i.e.
  no textlines)
2026-04-24 15:27:29 +02:00
Robert Sachunsky
a2f43b8d69 simplify, add confidence for headings as well 2026-04-23 21:14:39 +02:00
Robert Sachunsky
264b00f8ab predictor: cache models' input shape instead of output shape 2026-04-23 21:14:39 +02:00
Robert Sachunsky
829256df91 do_prediction*: remove autosized variants, simplify 2026-04-23 21:14:39 +02:00
Robert Sachunsky
de65a55a04 mbro: simplify, add drop-caps as well, reduce batch size…
- do_order_of_regions_with_model:
  * add `polygons_of_drop_capitals`, order these indices as well
    (model was not trained for this, but it works)
  * explicit label identifiers instead of number literals
  * map marginals and images correctly
  * simplify (a lot)
  * reduce inference batch size to accomodate 8 GB VRAM GPUs
- return_indexes_of_contours_located_inside_another_list_of_contours:
  simplify
2026-04-23 21:14:39 +02:00
Robert Sachunsky
0dfc9d911f run_boxes_no_full_layout: also map to fl labels here…
(because -mbro assumes the label set from -fl)
2026-04-20 18:20:58 +02:00
Robert Sachunsky
0015f2675b with -slro, also extract and apply page (Border) mask 2026-04-20 18:20:58 +02:00
Robert Sachunsky
569b96d1a9 find_number_of_columns_in_document: pass correct label_seps…
- in fl: 6
- non-fl: 3 (now fixed)
2026-04-20 18:20:58 +02:00
Robert Sachunsky
f28a9c9e0b add confidence for all region types, prepare for textlines…
- pass on probabilities from predicted class everywhere
- rename `confidence_matrix` → `confidence_regions` / `regions_confidence`
- rename `get_textregion_confidences()` → `get_region_confidences()`
- add same for tables, textlines and regionsfl (full layout model)
- aggregate per-region confidence lists for image, table, drop-capital,
  left marginal and right marginal regions
- add in writer
- simplify/re-indent some
- try to replace more number literals with class label identifiers
2026-04-20 18:20:58 +02:00
Robert Sachunsky
1164b97917 extract_text_regions_new: fix heading thresholding…
- re-introduce boosting `heading` thresholding broken
  when refactoring (light version and do_prediction)
- also return confidence for full layout prediction
2026-04-20 18:20:58 +02:00