Commit graph

1165 commits

Author SHA1 Message Date
Robert Sachunsky
5a778003fd contour matching for deskewed image: ensure matches for both sides 2025-11-15 14:32:22 +01:00
Robert Sachunsky
3c15c4f7d4 back to rotate_image instead of rotation_image_new for deskewing
(because the latter does not preserve coordinates;
 it scales, even when resizing the image;
 this caused coordinate problems when matching deskewed contours)
2025-11-15 14:29:41 +01:00
Robert Sachunsky
4475183f08 improve rules governing column split
- reduce `sigma` for smoothing of input to `find_peaks`
  (so we get deeper gaps between columns)
- allow column boundaries closer to the margins
  (50 instead of 100 or 200 px, 170 instead of 370 px)
- allow column boundaries closer to each other
  (300 instead of 400 px)
- add a secondary `grenze` criterion for depth of gap
  (relative to lowest minimum, if that is smaller than
   the old criterion relative to lowest maximum)
- for calls to `find_num_col` within parts of a page,
  do allow unbalanced column boundaries
2025-11-14 13:15:09 +01:00
Robert Sachunsky
4abc2ff572 rewrite/simplify manual reading order using recursive algorithm
- rename `return_x_start_end_mothers_childs_and_type_of_reading_order`
  → `return_multicol_separators_x_start_end`, and drop all the analysis
  pertaining to mother/child relationships and full-span separators,
  also drop the separator unification rules;
  instead of the latter, try to combine neighbouring separators more
  generally: join column spans iff there is nothing in between
  (which also necessitates passing the region mask), and keep only
  one of every such redundant pair;
  add the top (of each page part) as full-span separator up front,
  and return separators already ordered by y
- `return_boxes_of_images_by_order_of_reading_new`:
  - also pass regions with separators, so they do not have to be
    reconstructed from the separator coordinates, and also contain
    images and other non-text region types, when trying to elongate
    separators to maximize their span (without introducing overlaps)
  - determine connected components of the region mask, i.e. labels
    and their respective bboxes, in order to
    1. gain additional multi-column separators, if possible
    2. avoid cutting through regions which do cross column boundaries
       later on
  - whenever adding a new bbox, first look up the label map to see if
    there are any multi-column regions extending to the right of the
    current column; if there are, then advance not just one column
    to the right, but as many as necessary to avoid cutting through
    these regions
  - new core algorithm: iterate separators sorted by y and then column
    by column, but whenever the next separator ends in the same column
    as the current one or even further left, recurse (i.e. finish that
    span first before continuing with the top iteration)
2025-11-14 13:14:53 +01:00
Robert Sachunsky
95f76081d1 rename some more identifiers:
- `lines` → `seps` (to distinguish from textlines)
- `text_regions_p_1_n` → `text_regions_p_d` (because all other
  deskewed variables are called like this)
- `pixel` → `label`
2025-11-14 13:13:50 +01:00
Robert Sachunsky
1a76ce177d do_order_of_regions: round contour centers
(so we can be sure they do not fall through the
 "pixel cracks": bboxes are delimited by integers,
 and we do not want to assign contours between
 boxes)
2025-11-14 13:08:10 +01:00
Robert Sachunsky
19b2c3fa42 reading order: improve handling of headings and horizontal seps
- drop connected components analysis to test overlaps between
  horizontal separators and (horizontal) neighbours (introduced
  in ab17a927)
- instead of converting headings to topline and baseline during
  `find_number_of_columns_in_document` (introduced in 9f1595d7),
  add them to the matrix unchanged, but mark as extra type
  (besides horizontal and vertical separtors)
- convert headings to toplines and baselines no earlier than in
  `return_boxes_of_images_by_order_of_reading_new`
- for both headings and horizontal separators, if they already
  span multiple columns, check if they would overlap (horizontal)
  neighbours by looking at successively larger (left and right)
  intervals of columns (and pick the largest elongation which
  does not introduce any overlaps)
2025-10-25 13:36:35 +02:00
Robert Sachunsky
3367462d18 return_boxes_of_images_by_order_of_reading_new: change arg order 2025-10-25 13:36:24 +02:00
Robert Sachunsky
a2a9fe5117 delete_separator_around: simplify, eynollah: identifiers
- use array instead of list operations
- rename identifiers:
  - `pixel` → `label`
  - `line` → `sep`
2025-10-25 13:36:17 +02:00
Robert Sachunsky
3ebbc2d693 return_boxes_of_images_by_order_of_reading_new: indent
(by removing unnecessary conditional)
2025-10-25 13:36:06 +02:00
Robert Sachunsky
66a0e55e49 return_boxes_of_images_by_order_of_reading_new: avoid oversplits
when y slice (`top:bot`) is not a significant part of the page,
viz. less than 22% (as in `find_number_of_columns_in_document`),
avoid forcing `find_num_col` to reach `num_col_classifier`

(allows large headers not to be split up and thus better ordered)
2025-10-25 13:35:56 +02:00
Robert Sachunsky
6fbb5f8a12 return_boxes_of_images_by_order_of_reading_new: simplify
- array instead of list operations
- add better plotting (but commented out)
- add more debug printing (but commented out)
- add more inline comments for documentation
- rename identifiers to make more readable:
  - `cy_hor_diff` → `y_max_hor_some` (because the ymax gets passed)
  - `lines` → `seps`
  - `y_type_2` → `y_mid`
  - `y_diff_type_2` → `y_max`
  - `y_lines_by_order` → `y_mid_by_order`
  - `y_lines_without_mother` → `y_mid_without_mother`
  - `y_lines_with_child_without_mother` → `y_mid_with_child_without_mother`
  - `y_column` → `y_mid_column`
  - `y_column_nc` → `y_mid_column_nc`
  - `y_all_between_nm_wc` → `y_mid_between_nm_wc`
  - `lines_so_close_to_top_separator` → `seps_too_close_to_top_separator`
  - `y_in_cols` and `y_down` → `y_mid_next`
- use `pairwise()` `nc_top:nc_bot` instead of `i_c` indexing
2025-10-25 13:35:44 +02:00
Robert Sachunsky
6cc5900943 find_num_col: add better plotting (but commented out) 2025-10-25 13:35:34 +02:00
Robert Sachunsky
5d15941b35 contours_in_same_horizon: simplify
- array instead of list operations
- return array of index pairs instead of list objects
2025-10-25 13:35:26 +02:00
Robert Sachunsky
acee4c1bfe find_number_of_columns_in_document: simplify 2025-10-25 13:35:18 +02:00
Robert Sachunsky
b2a79cc6ed return_x_start_end_mothers_childs_and_type_of_reading_order: fix+1
when calculating `reading_order_type`, upper limit on column range
(`x_end`) needs to be `+1` here as well
2025-10-25 13:35:12 +02:00
Robert Sachunsky
e2dfec75fb return_x_start_end_mothers_childs_and_type_of_reading_order:
simplify and document

- simplify
- rename identifiers to make readable:
  - `y_sep` → `y_mid` (because the cy gets passed)
  - `y_diff` → `y_max` (because the ymax gets passed)
- array instead of list operations
- add docstring and in-line comments
- return (zero-length) numpy array instead of empty list
2025-10-25 13:35:06 +02:00
Robert Sachunsky
0fc4b2535d return_boxes_of_images_by_order_of_reading_new: fix no-mother case
- when handling lines without mother,
  and biggest line already accounts for all columns,
  but some are too close to the top and therefore must be removed,
  avoid invalidating `biggest` index, causing `IndexError`
- remove try-catch (now unnecessary)
- array instead of list operations
2025-10-25 13:34:58 +02:00
Robert Sachunsky
7c3e418588 return_boxes_of_images_by_order_of_reading_new: simplify
- enumeration instead of indexing
- array instead of list operations
- add better plotting (but commented out)
2025-10-25 13:34:52 +02:00
Robert Sachunsky
cd35241e81 find_number_of_columns_in_document: split headings at top+baseline
regarding `splitter_y` result, for headings, instead of cutting right
through them via center line, add their toplines and baselines as if
they were horizontal separators
2025-10-25 13:34:35 +02:00
Robert Sachunsky
5a0e4c3b0f find_number_of_columns_in_document: improve splitter rule
extend horizontal separators to full img width if they do not overlap
any other regions

(only as regards to returned `splitter_y` result,
 but without changing returned separators mask)
2025-10-20 17:41:50 +02:00
Robert Sachunsky
542d38ab43 find_number_of_columns_in_document: simplify, rename lineseps 2025-10-20 17:41:49 +02:00
Robert Sachunsky
d3d599b010 order_of_regions: add better plotting (but commented out) 2025-10-20 17:41:47 +02:00
Robert Sachunsky
c43a825d1d order_of_regions: filter out-of-image peaks 2025-10-20 17:41:47 +02:00
Robert Sachunsky
48761c3e12 find_num_col: simplify, add better plotting (but commented out) 2025-10-20 17:41:45 +02:00
Robert Sachunsky
184927fb54 find_num_cols: re-sort peaks when cutting n-best num_col_classifier 2025-10-20 17:41:44 +02:00
Robert Sachunsky
086c1880ac binarization: add option --overwrite, skip existing outputs
(also, simplify `run` and separate `run_single`)
2025-10-20 17:40:52 +02:00
kba
38c028c6b5 📦 v0.6.0 2025-10-17 10:36:30 +02:00
kba
ca8edb35e3 📝 changelog 2025-10-17 10:35:13 +02:00
kba
50e8b2c266 Merge branch 'integrate-training-from-sbb_pixelwise_segmentation' 2025-10-17 10:33:04 +02:00
kba
46d25647f7 📝 changelog 2025-10-17 10:32:15 +02:00
Robert Sachunsky
2ac01ecacc join_polygons: try to catch rare case of MultiPolygon 2025-10-17 10:31:51 +02:00
kba
2e0fb64dcb disable ruff check for training code for now 2025-10-16 21:29:37 +02:00
kba
76c13bcfd7 Merge branch 'integrate-training-from-sbb_pixelwise_segmentation' of https://github.com/qurator-spk/eynollah into integrate-training-from-sbb_pixelwise_segmentation 2025-10-16 20:50:24 +02:00
kba
af5abb77fd Merge branch 'main' into integrate-training-from-sbb_pixelwise_segmentation 2025-10-16 20:50:16 +02:00
kba
d2f0a43088 📝 changelog 2025-10-16 20:46:49 +02:00
Konstantin Baierer
3bd3faef68
Merge pull request #193 from qurator-spk/training-installation
Training installation
2025-10-16 20:39:17 +02:00
kba
1e66c85222 Merge branch 'integrate-training-from-sbb_pixelwise_segmentation' into training-installation 2025-10-16 16:18:02 +02:00
kba
bd8c8bfeac training: pin numpy to <1.24 as well 2025-10-16 16:15:31 +02:00
Robert Sachunsky
948c8c3441 join_polygons: try to catch rare case of MultiPolygon 2025-10-15 16:58:17 +02:00
kba
f485dd4181 📦 v0.6.0rc2 2025-10-14 16:10:50 +02:00
kba
c1f0158806 📝 changelog 2025-10-14 14:53:15 +02:00
kba
7daa0a1bd5 Merge branch 'fix-196' into prepare-v0.6.0rc2 2025-10-14 14:52:36 +02:00
kba
2febf53479 📝 changelog 2025-10-14 14:52:31 +02:00
Robert Sachunsky
8299e7009a setup_models: avoid unnecessarily loading region_fl 2025-10-14 14:27:32 +02:00
Robert Sachunsky
e8b7212f36 polygon2contour: avoid uint for coords
(introduced in a433c736 to make consistent with
 `filter_contours_area_of_image`, but actually
 np.uint is prone to create overflows downstream)
2025-10-14 14:27:26 +02:00
kba
745cf3be48 XML encoding should be utf-8 not utf8
... and  should use OCR-D's generateDS PAGE API consistently
2025-10-10 16:39:17 +02:00
kba
2056a8bdb9 📦 v0.6.0rc1 2025-10-10 16:32:47 +02:00
Robert Sachunsky
4e9a1618c3 layout: refactor model setup, allow loading custom versions
- simplify definition of (defaults for) model versions
- unify loading of loadable models (depending on mode)
- use `self.models` dict instead of `self.model_*` attributes
- add `model_versions` kwarg / `--model_version` CLI option
2025-10-10 03:18:09 +02:00
Robert Sachunsky
374818de11 📝 update changelog for 5725e4f 2025-10-09 23:11:05 +02:00