Commit graph

1477 commits

Author SHA1 Message Date
kba
d2aae35446 . 2026-04-28 15:39:53 +02:00
kba
d705f855f1 . 2026-04-28 15:36:50 +02:00
kba
abdcb1a1f9 . 2026-04-28 15:33:57 +02:00
kba
69280187c5 . 2026-04-28 15:29:48 +02:00
kba
1ba82ede88 . 2026-04-28 15:25:36 +02:00
kba
be1296150c . 2026-04-28 15:07:33 +02:00
kba
4899a8fa17 . 2026-04-28 14:59:01 +02:00
kba
29ef9f09dc . 2026-04-28 14:53:13 +02:00
kba
511222704e . 2026-04-28 14:51:23 +02:00
kba
5c6e075975 Merge branch 'ocrd-wrappers' of https://github.com/qurator-spk/eynollah into ocrd-wrappers 2026-04-28 14:31:24 +02:00
kba
1ae862cf52 . 2026-04-28 14:31:15 +02:00
kba
a9e12a63da wp 2026-04-28 12:18:29 +02:00
kba
957dc66e7c organize ocrd-eynollah-segment like ocrd-sbb-binarize 2026-04-27 18:50:54 +02:00
Robert Sachunsky
bb092364af get_slopes_and_deskew_new_light2: estimate slopes here, too…
extract slopes from minimal bounding rectangles of textlines,
using heuristics on aspect ratios, lengths and angles
2026-04-24 15:27:29 +02:00
Robert Sachunsky
c478c03db4 avoid deskewed contour matching w/ -romb 2026-04-24 15:27:29 +02:00
Robert Sachunsky
998ee2ecee get_textlines_of_a_textregion_sorted: simplify 2026-04-24 15:27:29 +02:00
Robert Sachunsky
be61875d6e get_textlines_of_a_textregion_sorted: w-h instead of w/h test 2026-04-24 15:27:29 +02:00
Robert Sachunsky
9723dfeb73 writer: also annotate col-classifier result…
both notations:
- in `/PcGts/Page/@custom` (CSS-style)
- in `/PcGts/Metadata/Comment` (qurator-style)
2026-04-24 15:27:29 +02:00
Robert Sachunsky
e3720d6623 writer: also annotate page-level deskewing result 2026-04-24 15:27:29 +02:00
Robert Sachunsky
2da718f76f writer, do_work_of_slopes*: drop passing bboxes around
(needed no more)
2026-04-24 15:27:29 +02:00
Robert Sachunsky
b792324c5b do_work_of_slopes_new_curved (if angle >45°): simplify, improve…
- use new `rotate_image_enlarge` instead of
  custom (insufficient) padding w/ `rotate_image`
- get external contours instead of tree
  (without checking hierarchy afterwards)
- use largest textline contours by area instead of
  longest polygon path
- always use `separate_lines` (but without its incorrect
  angle/offset calculations) instead of `separate_lines_vertical_cont`
- calculate coordinate transformation (shift, angle)
  for all cases (including >45°)
- simplify
2026-04-24 15:27:29 +02:00
Robert Sachunsky
dbdb6d0d53 rotate: rm unused failed variants, add new rotate_image_enlarge
(correct version that enlarges canvas instead of clipping corners,
 using only OpenCV)
2026-04-24 15:27:29 +02:00
Robert Sachunsky
d257869d83 do_work_of_slopes_new_curved (if angle <45°): simplify, improve…
- use relative images, cropped to parent bbox (faster)
- no `scale` parameter (unused)
- use largest textline contours by area instead of first
- simplify
2026-04-24 15:27:29 +02:00
Robert Sachunsky
0dce1f24d2 do_work_of_slopes_new_curved: improve deskewing…
- return early if textline mask is empty
- intersect textline mask with parent mask
  (so neighbouring, truncated textlines
   will not interfere)
- fix bug when resulting angle is small:
  rather, compare with page angle
- if there is more than 1 line in the region,
  * use median instead of mean to estimate y_diff
  * if height dominates over width and x_diff
    over y_diff, then assume 90°: transpose image,
    deskew on that, then add 90° to result
- otherwise instead of just using page angle,
  try to estimate single-line angle by approximating
  slope of linear x-y regression on mask image;
  again, if height dominates over width, then
  assume +90° and use transposed image
- drop unused `scale` param
2026-04-24 15:27:29 +02:00
Robert Sachunsky
97d9b0ea50 small_textlines_to_parent_adherence2: simplify, improve…
- when merging large line with small lines,
  don't use first new contour but largest
- get external contours instead of tree
  (without checking hierarchy afterwards)
- simplify
2026-04-24 15:27:29 +02:00
Robert Sachunsky
0735cb9d2b filter_contours_without_textline_inside: also filter slopes 2026-04-24 15:27:29 +02:00
Robert Sachunsky
fa8340dbb4 -cl: also filter textregions without textlines here 2026-04-24 15:27:29 +02:00
Robert Sachunsky
4a6d3968f9 major run_single refactoring…
- rename `get_regions()` → `get_early_layout()`
- split up `run_boxes_no/full_layout()` into shared
  * `get_full_layout()` (for lapping mapping,
    table decoding and optional full model prediction)
  * `get_deskewed_masks()` (for de-rotation)
  * extraction of various region types (polygons and confidences)
  * `run_boxes_order()` (for column detection and box ordering)
- rename `contours_tables` → `polygons_of_tables`

This further reduces redundant code, avoids splitting up the same
functionality across different places depending on mode etc.
2026-04-24 15:27:29 +02:00
Robert Sachunsky
dfb40f4a49 hsep fusion: avoid zero division if zero overlap 2026-04-24 15:27:29 +02:00
Robert Sachunsky
b63e073121 skip deskewing if no textlines 2026-04-24 15:27:29 +02:00
Robert Sachunsky
7b5aa2a1f6 more run_single refactoring…
- `run_single`: re-use `return_contours_of_interested_region`
  for extraction and filtering of text region contours
- `run_single`: isolate new function `match_deskewed_contours`
- `run_single`: apply dilation afterwards
- rename `contours_only_text_parent_d_ordered` → `polygons_of_textregions_d`
- rename `contours_only_text_parent` → `polygons_of_textregions`
- rename `contours_only_text_parent_h` → `polygons_of_textregions_h`
- `do_work_of_slopes_new_curved` and `get_slopes_and_deskew_new_curved`:
   no need for `mask_texts_only` array arg
- `filter_contours_inside_a_bigger_one`: no need for `image` as array arg,
  simplify
- `split_textregion_main_vs_head`: simplify, re-order arguments
  and return tuple logically
- if no main text regions are found, just convert marginals to main text
  and continue normally instead of stopping early w/ empty marginals (i.e.
  no textlines)
2026-04-24 15:27:29 +02:00
Robert Sachunsky
a2f43b8d69 simplify, add confidence for headings as well 2026-04-23 21:14:39 +02:00
Robert Sachunsky
264b00f8ab predictor: cache models' input shape instead of output shape 2026-04-23 21:14:39 +02:00
Robert Sachunsky
829256df91 do_prediction*: remove autosized variants, simplify 2026-04-23 21:14:39 +02:00
Robert Sachunsky
de65a55a04 mbro: simplify, add drop-caps as well, reduce batch size…
- do_order_of_regions_with_model:
  * add `polygons_of_drop_capitals`, order these indices as well
    (model was not trained for this, but it works)
  * explicit label identifiers instead of number literals
  * map marginals and images correctly
  * simplify (a lot)
  * reduce inference batch size to accomodate 8 GB VRAM GPUs
- return_indexes_of_contours_located_inside_another_list_of_contours:
  simplify
2026-04-23 21:14:39 +02:00
Robert Sachunsky
0dfc9d911f run_boxes_no_full_layout: also map to fl labels here…
(because -mbro assumes the label set from -fl)
2026-04-20 18:20:58 +02:00
Robert Sachunsky
0015f2675b with -slro, also extract and apply page (Border) mask 2026-04-20 18:20:58 +02:00
Robert Sachunsky
569b96d1a9 find_number_of_columns_in_document: pass correct label_seps…
- in fl: 6
- non-fl: 3 (now fixed)
2026-04-20 18:20:58 +02:00
Robert Sachunsky
f28a9c9e0b add confidence for all region types, prepare for textlines…
- pass on probabilities from predicted class everywhere
- rename `confidence_matrix` → `confidence_regions` / `regions_confidence`
- rename `get_textregion_confidences()` → `get_region_confidences()`
- add same for tables, textlines and regionsfl (full layout model)
- aggregate per-region confidence lists for image, table, drop-capital,
  left marginal and right marginal regions
- add in writer
- simplify/re-indent some
- try to replace more number literals with class label identifiers
2026-04-20 18:20:58 +02:00
Robert Sachunsky
1164b97917 extract_text_regions_new: fix heading thresholding…
- re-introduce boosting `heading` thresholding broken
  when refactoring (light version and do_prediction)
- also return confidence for full layout prediction
2026-04-20 18:20:58 +02:00
Robert Sachunsky
20dc5c3188 also cover drop-capital in (heuristic) reading order 2026-04-20 18:20:58 +02:00
Robert Sachunsky
92e94753c7 decoding of dropcaps in -fl: ensure consistency w/ early layout…
1. use connected component analysis to get unique segments
   in early prediction result
2. for each drop-capital segment in full prediction result,
   find matching early segment
3. when they have high overlap, assign drop-capital label
   to the entire early segment
2026-04-20 18:20:58 +02:00
Robert Sachunsky
29b42fdfaa decoding of drop-capitals in full layout: also allow replacing img…
- rename `putt_bb_of_drop_capitals_of_model_in_patches_in_layout`
  → `fill_bb_of_drop_capitals`
- also allow image (besides text) label in early layout prediction
  result when checking if entire bbox can be filled (as opposed to
  just drop-capital | image | background mask)
- simplify
2026-04-16 18:37:27 +02:00
Robert Sachunsky
6e0aed35f4 run_boxes_*: simplify, document class label mappings, start using
identifier constants instead of literals for labels
2026-04-16 18:37:27 +02:00
Robert Sachunsky
f29e876a7c return_boxes_of_images_by_order_of_reading_new: sep label differs w/o -fl…
fix bug where in non-full mode, the wrong class label was assumed
for separator regions (3 in non- vs 6 in full layout mode):

- pass in separator mask instead of full segmentation map
- rename for clarity:
  - `regions_without_separators` → `text_mask` (alread binary)
  - `regions_with_separators` → `sep_mask` (now just binary)
2026-04-16 05:16:23 +02:00
Robert Sachunsky
f5f2435a38 run_marginals: drop unnecessarily passing textline_mask, mask_seps, mask_images 2026-04-16 05:13:06 +02:00
Robert Sachunsky
9309586712 split_textregion_main_vs_header → split_textregion_main_vs_head…
(and simplify)
2026-04-16 05:07:22 +02:00
Robert Sachunsky
0f82b568ba do_prediction_new_concept: aggregate confidence for all classes…
(not just text; will still have to pass that on to the writer...)
2026-04-16 05:02:20 +02:00
Robert Sachunsky
5a27e46b22 keep seps over artificial boundaries to improve col separation…
(thresholding and decoding with artificial boundary class can
 overwrite existing column separators, which in turn can contribute
 to missing column boundaries; this prioritises seps over boundaries,
 which does not impair separation of instances, as seps will separate
 text/image/etc instances just as well as artificial boundaries)
2026-04-16 04:56:38 +02:00
Robert Sachunsky
9d6ff65e1d get_tables_from_model: utilise artificial bound thresholding…
(to improve separation of neighbouring tables, esp. across
 columns; since model's threshold class is particularly weak,
 also use lower threshold here)
2026-04-16 04:49:07 +02:00