Commit graph

1156 commits

Author SHA1 Message Date
Robert Sachunsky
3ebbc2d693 return_boxes_of_images_by_order_of_reading_new: indent
(by removing unnecessary conditional)
2025-10-25 13:36:06 +02:00
Robert Sachunsky
66a0e55e49 return_boxes_of_images_by_order_of_reading_new: avoid oversplits
when y slice (`top:bot`) is not a significant part of the page,
viz. less than 22% (as in `find_number_of_columns_in_document`),
avoid forcing `find_num_col` to reach `num_col_classifier`

(allows large headers not to be split up and thus better ordered)
2025-10-25 13:35:56 +02:00
Robert Sachunsky
6fbb5f8a12 return_boxes_of_images_by_order_of_reading_new: simplify
- array instead of list operations
- add better plotting (but commented out)
- add more debug printing (but commented out)
- add more inline comments for documentation
- rename identifiers to make more readable:
  - `cy_hor_diff` → `y_max_hor_some` (because the ymax gets passed)
  - `lines` → `seps`
  - `y_type_2` → `y_mid`
  - `y_diff_type_2` → `y_max`
  - `y_lines_by_order` → `y_mid_by_order`
  - `y_lines_without_mother` → `y_mid_without_mother`
  - `y_lines_with_child_without_mother` → `y_mid_with_child_without_mother`
  - `y_column` → `y_mid_column`
  - `y_column_nc` → `y_mid_column_nc`
  - `y_all_between_nm_wc` → `y_mid_between_nm_wc`
  - `lines_so_close_to_top_separator` → `seps_too_close_to_top_separator`
  - `y_in_cols` and `y_down` → `y_mid_next`
- use `pairwise()` `nc_top:nc_bot` instead of `i_c` indexing
2025-10-25 13:35:44 +02:00
Robert Sachunsky
6cc5900943 find_num_col: add better plotting (but commented out) 2025-10-25 13:35:34 +02:00
Robert Sachunsky
5d15941b35 contours_in_same_horizon: simplify
- array instead of list operations
- return array of index pairs instead of list objects
2025-10-25 13:35:26 +02:00
Robert Sachunsky
acee4c1bfe find_number_of_columns_in_document: simplify 2025-10-25 13:35:18 +02:00
Robert Sachunsky
b2a79cc6ed return_x_start_end_mothers_childs_and_type_of_reading_order: fix+1
when calculating `reading_order_type`, upper limit on column range
(`x_end`) needs to be `+1` here as well
2025-10-25 13:35:12 +02:00
Robert Sachunsky
e2dfec75fb return_x_start_end_mothers_childs_and_type_of_reading_order:
simplify and document

- simplify
- rename identifiers to make readable:
  - `y_sep` → `y_mid` (because the cy gets passed)
  - `y_diff` → `y_max` (because the ymax gets passed)
- array instead of list operations
- add docstring and in-line comments
- return (zero-length) numpy array instead of empty list
2025-10-25 13:35:06 +02:00
Robert Sachunsky
0fc4b2535d return_boxes_of_images_by_order_of_reading_new: fix no-mother case
- when handling lines without mother,
  and biggest line already accounts for all columns,
  but some are too close to the top and therefore must be removed,
  avoid invalidating `biggest` index, causing `IndexError`
- remove try-catch (now unnecessary)
- array instead of list operations
2025-10-25 13:34:58 +02:00
Robert Sachunsky
7c3e418588 return_boxes_of_images_by_order_of_reading_new: simplify
- enumeration instead of indexing
- array instead of list operations
- add better plotting (but commented out)
2025-10-25 13:34:52 +02:00
Robert Sachunsky
cd35241e81 find_number_of_columns_in_document: split headings at top+baseline
regarding `splitter_y` result, for headings, instead of cutting right
through them via center line, add their toplines and baselines as if
they were horizontal separators
2025-10-25 13:34:35 +02:00
Robert Sachunsky
5a0e4c3b0f find_number_of_columns_in_document: improve splitter rule
extend horizontal separators to full img width if they do not overlap
any other regions

(only as regards to returned `splitter_y` result,
 but without changing returned separators mask)
2025-10-20 17:41:50 +02:00
Robert Sachunsky
542d38ab43 find_number_of_columns_in_document: simplify, rename lineseps 2025-10-20 17:41:49 +02:00
Robert Sachunsky
d3d599b010 order_of_regions: add better plotting (but commented out) 2025-10-20 17:41:47 +02:00
Robert Sachunsky
c43a825d1d order_of_regions: filter out-of-image peaks 2025-10-20 17:41:47 +02:00
Robert Sachunsky
48761c3e12 find_num_col: simplify, add better plotting (but commented out) 2025-10-20 17:41:45 +02:00
Robert Sachunsky
184927fb54 find_num_cols: re-sort peaks when cutting n-best num_col_classifier 2025-10-20 17:41:44 +02:00
Robert Sachunsky
086c1880ac binarization: add option --overwrite, skip existing outputs
(also, simplify `run` and separate `run_single`)
2025-10-20 17:40:52 +02:00
kba
38c028c6b5 📦 v0.6.0 2025-10-17 10:36:30 +02:00
kba
ca8edb35e3 📝 changelog 2025-10-17 10:35:13 +02:00
kba
50e8b2c266 Merge branch 'integrate-training-from-sbb_pixelwise_segmentation' 2025-10-17 10:33:04 +02:00
kba
46d25647f7 📝 changelog 2025-10-17 10:32:15 +02:00
Robert Sachunsky
2ac01ecacc join_polygons: try to catch rare case of MultiPolygon 2025-10-17 10:31:51 +02:00
kba
2e0fb64dcb disable ruff check for training code for now 2025-10-16 21:29:37 +02:00
kba
76c13bcfd7 Merge branch 'integrate-training-from-sbb_pixelwise_segmentation' of https://github.com/qurator-spk/eynollah into integrate-training-from-sbb_pixelwise_segmentation 2025-10-16 20:50:24 +02:00
kba
af5abb77fd Merge branch 'main' into integrate-training-from-sbb_pixelwise_segmentation 2025-10-16 20:50:16 +02:00
kba
d2f0a43088 📝 changelog 2025-10-16 20:46:49 +02:00
Konstantin Baierer
3bd3faef68
Merge pull request #193 from qurator-spk/training-installation
Training installation
2025-10-16 20:39:17 +02:00
kba
1e66c85222 Merge branch 'integrate-training-from-sbb_pixelwise_segmentation' into training-installation 2025-10-16 16:18:02 +02:00
kba
bd8c8bfeac training: pin numpy to <1.24 as well 2025-10-16 16:15:31 +02:00
Robert Sachunsky
948c8c3441 join_polygons: try to catch rare case of MultiPolygon 2025-10-15 16:58:17 +02:00
kba
f485dd4181 📦 v0.6.0rc2 2025-10-14 16:10:50 +02:00
kba
c1f0158806 📝 changelog 2025-10-14 14:53:15 +02:00
kba
7daa0a1bd5 Merge branch 'fix-196' into prepare-v0.6.0rc2 2025-10-14 14:52:36 +02:00
kba
2febf53479 📝 changelog 2025-10-14 14:52:31 +02:00
Robert Sachunsky
8299e7009a setup_models: avoid unnecessarily loading region_fl 2025-10-14 14:27:32 +02:00
Robert Sachunsky
e8b7212f36 polygon2contour: avoid uint for coords
(introduced in a433c736 to make consistent with
 `filter_contours_area_of_image`, but actually
 np.uint is prone to create overflows downstream)
2025-10-14 14:27:26 +02:00
kba
745cf3be48 XML encoding should be utf-8 not utf8
... and  should use OCR-D's generateDS PAGE API consistently
2025-10-10 16:39:17 +02:00
kba
2056a8bdb9 📦 v0.6.0rc1 2025-10-10 16:32:47 +02:00
Robert Sachunsky
4e9a1618c3 layout: refactor model setup, allow loading custom versions
- simplify definition of (defaults for) model versions
- unify loading of loadable models (depending on mode)
- use `self.models` dict instead of `self.model_*` attributes
- add `model_versions` kwarg / `--model_version` CLI option
2025-10-10 03:18:09 +02:00
Robert Sachunsky
374818de11 📝 update changelog for 5725e4f 2025-10-09 23:11:05 +02:00
Robert Sachunsky
c4cb16c2a8 simplify
(`skip_layout_and_reading_order` is already an attr)
2025-10-09 23:05:50 +02:00
Robert Sachunsky
ecb53056f2 Merge branch 'main' of https://github.com/qurator-spk/eynollah into loky-with-shm-for-175-rebuilt 2025-10-09 22:54:11 +02:00
Robert Sachunsky
d96af425a7
Merge pull request #4 from bertsky/loky-with-shm-for-175-rebuilt-refactored
refactoring for 192: speedup and improvements
2025-10-09 22:18:53 +02:00
Robert Sachunsky
cab392601e 📝 update changelog 2025-10-09 20:14:11 +02:00
Robert Sachunsky
e1b56d97da CI: lint with ruff 2025-10-09 20:14:11 +02:00
Robert Sachunsky
a144026b27 add rough ruff config 2025-10-09 20:14:11 +02:00
Robert Sachunsky
b3d29bef89 return_contours_of_interested_region*: rm unused variants 2025-10-09 20:14:11 +02:00
Robert Sachunsky
8a2d682e12 fix identifier scope in layout OCR options (w/o full_layout) 2025-10-09 20:14:11 +02:00
Robert Sachunsky
096def1e9d mbreorder/enhancment: fix missing imports
(not sure if these models really need that, though)
2025-10-09 20:14:11 +02:00