Commit graph

1494 commits

Author SHA1 Message Date
Robert Sachunsky
ea8f985ff1 apply cropping only after textline and early layout…
(because old models seem to fare better that way,
 despite training documentation)
2026-05-08 18:41:47 +02:00
Robert Sachunsky
58afdf5e87 do_prediction*(): ensure always returns dtype=uint8 2026-05-08 17:36:31 +02:00
Robert Sachunsky
68a26a5c3f do_prediction*(): smooth window transitions with sigmoid…
instead of hard cut-offs between overlapping window tiles,
apply sigmoid attenuation to slide from one to the next

(apply all postprocessing in the end)
2026-05-08 05:18:00 +02:00
Robert Sachunsky
cefe596f8b do_prediction*(): avoid unnecessary tiles, simplify…
- calculation for number of tiles: sometimes one less
  tile is needed by making the previous last tile
  half-full on the right side
- calculation of window margins: fix case if dimension
  extends to full image shape
- simplify (identifiers, slicing etc)
2026-05-08 00:55:18 +02:00
Robert Sachunsky
d8c83d6137 make_valid(): avoid oversimplification, improve parameter search 2026-05-05 15:00:16 +02:00
Robert Sachunsky
45868e99cd get_slopes_and_deskew_new_light2: ignore tiny contour areas 2026-05-04 15:55:00 +02:00
Robert Sachunsky
934ac90e92 get_slopes_and_deskew_new_light2: avoid +/- 90° cancellation…
in `estimate_skew_contours()`, distinguish between angle stats
scattering around <45° vs >45°: in the latter case, use modulo
180° for averages - to avoid cancelling out +90° with -90°
2026-05-04 15:52:07 +02:00
Robert Sachunsky
29bb55ceff return_deskew_slop: no >90° search unless for full page, simplify 2026-05-01 00:27:00 +02:00
Robert Sachunsky
cbb3be0e01 add diagnostic plotting for prediction masking (commented) 2026-04-30 16:12:00 +02:00
Robert Sachunsky
33c055389d bold run_single refactoring (predict segmentation on cropped img)…
- move `extract_page()` to the start (right after enhancement),
  so early layout and textline model prediction sees cropped
  image
- `extract_page()`: also return page mask
- `get_early_layout()`:
  * use cropped image
  * also run optional table prediction here,
    map table label and confidence already
    (so no need to pass these arrays everywhere)
  * suppress all non-text type regions in textline mask
  * also return text+table mask
    (so no need to reconstruct it everywhere)
- apply page mask to textline mask and early layout result
  (i.e. suppress areas beyond border contour)
- `run_graphics_and_columns()`:
  * rename → `run_columns()`
  * no table prediction here
  * no page extraction here
  * no page cropping+masking here
  * no textline mask suppression here
- `run_graphics_and_columns_without_layout()`: drop
  (not needed anymore)
- `run_marginals()` vs. `get_marginals()`: extract
  `text_mask` internally from early layout
- early page cropping for col-classifier:
  also use cropped image in input binarization mode
- early page cropping for col-classifier:
  get external contours instead of indiscriminate tree
- writer: skip layout mode now also uses cropped coordinates
  (so drop kwarg for it)
2026-04-30 16:12:00 +02:00
Robert Sachunsky
7e7cc6a801 do_order_of_regions(): use region mask instead of textline mask…
for local (within-box) ordering of region contours, use the same
text mask (merely eroded) as for the contour extraction itself:
the text+table+drop mask from early+full layout prediction,
rather than the textline mask, because the latter may be empty
in some boxes and is unlikely to be more useful than the region
mask itself
2026-04-30 16:11:59 +02:00
Robert Sachunsky
63df9be4db find_number_of_columns_in_document(): pass in (reuse) masks 2026-04-30 16:11:59 +02:00
Robert Sachunsky
da9e00cfe5 consistently handle textline mask with respect to drop-capital mask…
- suppress drop-capital in textline mask for textline contours
- elevate drop-capital in textline mask for reading order boxes
2026-04-30 16:11:59 +02:00
Robert Sachunsky
2641171fb1 return_boxes_...order_of_reading...: avoid negative slices…
fix rare bug when horizontal separators are detected
by the very top (of a major vertical part of the page),
causing box intervals to become negative
2026-04-30 16:11:59 +02:00
Robert Sachunsky
6a92f0d49c make get_deskewed_masks() unconditional, call only when needed 2026-04-30 16:11:59 +02:00
Robert Sachunsky
52eb4c9a0a move label definition and deskewing cancellation up 2026-04-30 16:11:59 +02:00
Robert Sachunsky
fa882e1dbe move run_boxes_order() call to RO section of run_single() 2026-04-30 16:11:59 +02:00
Robert Sachunsky
d88bd485ff get_slopes*(): does not need passing boxes separately 2026-04-30 16:11:59 +02:00
Robert Sachunsky
869646cbf5 get_full_layout() does not need the textline mask 2026-04-30 16:11:59 +02:00
Robert Sachunsky
b5bc161a4c extract_page(): get external contours instead of indiscriminate tree 2026-04-30 16:11:59 +02:00
Robert Sachunsky
287bebde0d get_marginals(): fix height factor for mask resizing 2026-04-30 16:11:59 +02:00
Robert Sachunsky
a031d590b8 get_marginals(): do allow both left and right point (f/u 4bdea39)…
(as there are valid cases where both left and right marginalia
 is present) follow-up 4bdea39 by re-allowing left point _and_
right point - but still score-based, and not if very asymmetric
2026-04-30 16:11:59 +02:00
Robert Sachunsky
9571ce3474 get_marginals(): reduce indentation 2026-04-30 16:11:52 +02:00
Robert Sachunsky
c18deb0722 drop relabelling all marginalia to main if no main (now unnecessary) 2026-04-30 16:09:03 +02:00
Robert Sachunsky
1f6db34adf run/get_marginals(): simplify and speed up…
- `get_marginals` modifies region labels in-place anyways,
  so no need for retval
- de/rotate only inside `get_marginals` (for consistency)
- return early if no marginals detected
- `run_marginals`: only useful in 1 or 2 columns, so keep to
  that conditional branch; allows avoiding unnecessary resizing
  of images to and fro
- rename `text_regions_p_1` → `text_regions_p`
2026-04-30 16:09:03 +02:00
Robert Sachunsky
45a43f7e5e get_marginals(): fixup point_right fallback 2026-04-30 16:08:15 +02:00
Robert Sachunsky
68ceeec764 get_marginals(): improve contour assignment…
- use undeskewed mask for contour comparisons
  instead of deskewed mask (less precise)
- rename `text_with_lines` → `text_mask_d`
- rename `mask_marginals` → `main_mask_d`
- rename `text_regions` → `early_layout`
- rename `...textline...` → `...text...`
2026-04-25 03:06:34 +02:00
Robert Sachunsky
6d55d0b87b get_marginals(): improve peak point threshold criterion…
in search of valid peaks (gaps between text columns),
- drop absolute values for minimum gap depth
  (likely crafted for some fixed resolution examples)
- instead, use criterion relative to maximum column depth
  and page height (trying to loosely approximate the prior
  constants, albeit somewhat more permissive)
2026-04-25 02:23:16 +02:00
Robert Sachunsky
4bdea39c98 get_marginals(): improve left/right point selection…
in search of valid (above threshold) peaks:
- do not just pick right-most left and left-most right span;
- instead,
  * if no peaks on the left, then only search right
  * if no peaks on the right, then only search left
  * if peaks on both sides, then only better side
    (so never return marginals on both sides!)
  * use scoring for peaks that reflects their peak
    prominence and peak height (but keep positional
    range constraints for what constitues left and right)
2026-04-25 01:59:48 +02:00
Robert Sachunsky
70bf461c30 get_marginals(): simplify, improve…
- rename `thickness_along_y_percent` →
  `max_textline_thickness_percent`
- rename `marginlas_should_be_main_text` →
  `main_text_should_be_marginals`
- constrain `find_peaks()` by prominence and distance
- simplify (a lot)
- add comments for possible improvements
  and for plotting
2026-04-25 01:52:21 +02:00
Robert Sachunsky
bb092364af get_slopes_and_deskew_new_light2: estimate slopes here, too…
extract slopes from minimal bounding rectangles of textlines,
using heuristics on aspect ratios, lengths and angles
2026-04-24 15:27:29 +02:00
Robert Sachunsky
c478c03db4 avoid deskewed contour matching w/ -romb 2026-04-24 15:27:29 +02:00
Robert Sachunsky
998ee2ecee get_textlines_of_a_textregion_sorted: simplify 2026-04-24 15:27:29 +02:00
Robert Sachunsky
be61875d6e get_textlines_of_a_textregion_sorted: w-h instead of w/h test 2026-04-24 15:27:29 +02:00
Robert Sachunsky
9723dfeb73 writer: also annotate col-classifier result…
both notations:
- in `/PcGts/Page/@custom` (CSS-style)
- in `/PcGts/Metadata/Comment` (qurator-style)
2026-04-24 15:27:29 +02:00
Robert Sachunsky
e3720d6623 writer: also annotate page-level deskewing result 2026-04-24 15:27:29 +02:00
Robert Sachunsky
2da718f76f writer, do_work_of_slopes*: drop passing bboxes around
(needed no more)
2026-04-24 15:27:29 +02:00
Robert Sachunsky
b792324c5b do_work_of_slopes_new_curved (if angle >45°): simplify, improve…
- use new `rotate_image_enlarge` instead of
  custom (insufficient) padding w/ `rotate_image`
- get external contours instead of tree
  (without checking hierarchy afterwards)
- use largest textline contours by area instead of
  longest polygon path
- always use `separate_lines` (but without its incorrect
  angle/offset calculations) instead of `separate_lines_vertical_cont`
- calculate coordinate transformation (shift, angle)
  for all cases (including >45°)
- simplify
2026-04-24 15:27:29 +02:00
Robert Sachunsky
dbdb6d0d53 rotate: rm unused failed variants, add new rotate_image_enlarge
(correct version that enlarges canvas instead of clipping corners,
 using only OpenCV)
2026-04-24 15:27:29 +02:00
Robert Sachunsky
d257869d83 do_work_of_slopes_new_curved (if angle <45°): simplify, improve…
- use relative images, cropped to parent bbox (faster)
- no `scale` parameter (unused)
- use largest textline contours by area instead of first
- simplify
2026-04-24 15:27:29 +02:00
Robert Sachunsky
0dce1f24d2 do_work_of_slopes_new_curved: improve deskewing…
- return early if textline mask is empty
- intersect textline mask with parent mask
  (so neighbouring, truncated textlines
   will not interfere)
- fix bug when resulting angle is small:
  rather, compare with page angle
- if there is more than 1 line in the region,
  * use median instead of mean to estimate y_diff
  * if height dominates over width and x_diff
    over y_diff, then assume 90°: transpose image,
    deskew on that, then add 90° to result
- otherwise instead of just using page angle,
  try to estimate single-line angle by approximating
  slope of linear x-y regression on mask image;
  again, if height dominates over width, then
  assume +90° and use transposed image
- drop unused `scale` param
2026-04-24 15:27:29 +02:00
Robert Sachunsky
97d9b0ea50 small_textlines_to_parent_adherence2: simplify, improve…
- when merging large line with small lines,
  don't use first new contour but largest
- get external contours instead of tree
  (without checking hierarchy afterwards)
- simplify
2026-04-24 15:27:29 +02:00
Robert Sachunsky
0735cb9d2b filter_contours_without_textline_inside: also filter slopes 2026-04-24 15:27:29 +02:00
Robert Sachunsky
fa8340dbb4 -cl: also filter textregions without textlines here 2026-04-24 15:27:29 +02:00
Robert Sachunsky
4a6d3968f9 major run_single refactoring…
- rename `get_regions()` → `get_early_layout()`
- split up `run_boxes_no/full_layout()` into shared
  * `get_full_layout()` (for lapping mapping,
    table decoding and optional full model prediction)
  * `get_deskewed_masks()` (for de-rotation)
  * extraction of various region types (polygons and confidences)
  * `run_boxes_order()` (for column detection and box ordering)
- rename `contours_tables` → `polygons_of_tables`

This further reduces redundant code, avoids splitting up the same
functionality across different places depending on mode etc.
2026-04-24 15:27:29 +02:00
Robert Sachunsky
dfb40f4a49 hsep fusion: avoid zero division if zero overlap 2026-04-24 15:27:29 +02:00
Robert Sachunsky
b63e073121 skip deskewing if no textlines 2026-04-24 15:27:29 +02:00
Robert Sachunsky
7b5aa2a1f6 more run_single refactoring…
- `run_single`: re-use `return_contours_of_interested_region`
  for extraction and filtering of text region contours
- `run_single`: isolate new function `match_deskewed_contours`
- `run_single`: apply dilation afterwards
- rename `contours_only_text_parent_d_ordered` → `polygons_of_textregions_d`
- rename `contours_only_text_parent` → `polygons_of_textregions`
- rename `contours_only_text_parent_h` → `polygons_of_textregions_h`
- `do_work_of_slopes_new_curved` and `get_slopes_and_deskew_new_curved`:
   no need for `mask_texts_only` array arg
- `filter_contours_inside_a_bigger_one`: no need for `image` as array arg,
  simplify
- `split_textregion_main_vs_head`: simplify, re-order arguments
  and return tuple logically
- if no main text regions are found, just convert marginals to main text
  and continue normally instead of stopping early w/ empty marginals (i.e.
  no textlines)
2026-04-24 15:27:29 +02:00
Robert Sachunsky
a2f43b8d69 simplify, add confidence for headings as well 2026-04-23 21:14:39 +02:00
Robert Sachunsky
264b00f8ab predictor: cache models' input shape instead of output shape 2026-04-23 21:14:39 +02:00