Robert Sachunsky
cf5a0bacd2
Merge 19b2c3fa42 into 38c028c6b5
2025-10-25 13:36:48 +02:00
Robert Sachunsky
19b2c3fa42
reading order: improve handling of headings and horizontal seps
...
- drop connected components analysis to test overlaps between
horizontal separators and (horizontal) neighbours (introduced
in ab17a927)
- instead of converting headings to topline and baseline during
`find_number_of_columns_in_document` (introduced in 9f1595d7),
add them to the matrix unchanged, but mark as extra type
(besides horizontal and vertical separtors)
- convert headings to toplines and baselines no earlier than in
`return_boxes_of_images_by_order_of_reading_new`
- for both headings and horizontal separators, if they already
span multiple columns, check if they would overlap (horizontal)
neighbours by looking at successively larger (left and right)
intervals of columns (and pick the largest elongation which
does not introduce any overlaps)
2025-10-25 13:36:35 +02:00
Robert Sachunsky
3367462d18
return_boxes_of_images_by_order_of_reading_new: change arg order
2025-10-25 13:36:24 +02:00
Robert Sachunsky
a2a9fe5117
delete_separator_around: simplify, eynollah: identifiers
...
- use array instead of list operations
- rename identifiers:
- `pixel` → `label`
- `line` → `sep`
2025-10-25 13:36:17 +02:00
Robert Sachunsky
3ebbc2d693
return_boxes_of_images_by_order_of_reading_new: indent
...
(by removing unnecessary conditional)
2025-10-25 13:36:06 +02:00
Robert Sachunsky
66a0e55e49
return_boxes_of_images_by_order_of_reading_new: avoid oversplits
...
when y slice (`top:bot`) is not a significant part of the page,
viz. less than 22% (as in `find_number_of_columns_in_document`),
avoid forcing `find_num_col` to reach `num_col_classifier`
(allows large headers not to be split up and thus better ordered)
2025-10-25 13:35:56 +02:00
Robert Sachunsky
6fbb5f8a12
return_boxes_of_images_by_order_of_reading_new: simplify
...
- array instead of list operations
- add better plotting (but commented out)
- add more debug printing (but commented out)
- add more inline comments for documentation
- rename identifiers to make more readable:
- `cy_hor_diff` → `y_max_hor_some` (because the ymax gets passed)
- `lines` → `seps`
- `y_type_2` → `y_mid`
- `y_diff_type_2` → `y_max`
- `y_lines_by_order` → `y_mid_by_order`
- `y_lines_without_mother` → `y_mid_without_mother`
- `y_lines_with_child_without_mother` → `y_mid_with_child_without_mother`
- `y_column` → `y_mid_column`
- `y_column_nc` → `y_mid_column_nc`
- `y_all_between_nm_wc` → `y_mid_between_nm_wc`
- `lines_so_close_to_top_separator` → `seps_too_close_to_top_separator`
- `y_in_cols` and `y_down` → `y_mid_next`
- use `pairwise()` `nc_top:nc_bot` instead of `i_c` indexing
2025-10-25 13:35:44 +02:00
Robert Sachunsky
6cc5900943
find_num_col: add better plotting (but commented out)
2025-10-25 13:35:34 +02:00
Robert Sachunsky
5d15941b35
contours_in_same_horizon: simplify
...
- array instead of list operations
- return array of index pairs instead of list objects
2025-10-25 13:35:26 +02:00
Robert Sachunsky
acee4c1bfe
find_number_of_columns_in_document: simplify
2025-10-25 13:35:18 +02:00
Robert Sachunsky
b2a79cc6ed
return_x_start_end_mothers_childs_and_type_of_reading_order: fix+1
...
when calculating `reading_order_type`, upper limit on column range
(`x_end`) needs to be `+1` here as well
2025-10-25 13:35:12 +02:00
Robert Sachunsky
e2dfec75fb
return_x_start_end_mothers_childs_and_type_of_reading_order:
...
simplify and document
- simplify
- rename identifiers to make readable:
- `y_sep` → `y_mid` (because the cy gets passed)
- `y_diff` → `y_max` (because the ymax gets passed)
- array instead of list operations
- add docstring and in-line comments
- return (zero-length) numpy array instead of empty list
2025-10-25 13:35:06 +02:00
Robert Sachunsky
0fc4b2535d
return_boxes_of_images_by_order_of_reading_new: fix no-mother case
...
- when handling lines without mother,
and biggest line already accounts for all columns,
but some are too close to the top and therefore must be removed,
avoid invalidating `biggest` index, causing `IndexError`
- remove try-catch (now unnecessary)
- array instead of list operations
2025-10-25 13:34:58 +02:00
Robert Sachunsky
7c3e418588
return_boxes_of_images_by_order_of_reading_new: simplify
...
- enumeration instead of indexing
- array instead of list operations
- add better plotting (but commented out)
2025-10-25 13:34:52 +02:00
Robert Sachunsky
cd35241e81
find_number_of_columns_in_document: split headings at top+baseline
...
regarding `splitter_y` result, for headings, instead of cutting right
through them via center line, add their toplines and baselines as if
they were horizontal separators
2025-10-25 13:34:35 +02:00
Robert Sachunsky
5a0e4c3b0f
find_number_of_columns_in_document: improve splitter rule
...
extend horizontal separators to full img width if they do not overlap
any other regions
(only as regards to returned `splitter_y` result,
but without changing returned separators mask)
2025-10-20 17:41:50 +02:00
Robert Sachunsky
542d38ab43
find_number_of_columns_in_document: simplify, rename line→seps
2025-10-20 17:41:49 +02:00
Robert Sachunsky
d3d599b010
order_of_regions: add better plotting (but commented out)
2025-10-20 17:41:47 +02:00
Robert Sachunsky
c43a825d1d
order_of_regions: filter out-of-image peaks
2025-10-20 17:41:47 +02:00
Robert Sachunsky
48761c3e12
find_num_col: simplify, add better plotting (but commented out)
2025-10-20 17:41:45 +02:00
Robert Sachunsky
184927fb54
find_num_cols: re-sort peaks when cutting n-best num_col_classifier
2025-10-20 17:41:44 +02:00
Robert Sachunsky
086c1880ac
binarization: add option --overwrite, skip existing outputs
...
(also, simplify `run` and separate `run_single`)
2025-10-20 17:40:52 +02:00
kba
38c028c6b5
📦 v0.6.0
2025-10-17 10:36:30 +02:00
kba
ca8edb35e3
📝 changelog
2025-10-17 10:35:13 +02:00
kba
50e8b2c266
Merge branch 'integrate-training-from-sbb_pixelwise_segmentation'
2025-10-17 10:33:04 +02:00
kba
46d25647f7
📝 changelog
2025-10-17 10:32:15 +02:00
Robert Sachunsky
2ac01ecacc
join_polygons: try to catch rare case of MultiPolygon
2025-10-17 10:31:51 +02:00
kba
2e0fb64dcb
disable ruff check for training code for now
2025-10-16 21:29:37 +02:00
kba
76c13bcfd7
Merge branch 'integrate-training-from-sbb_pixelwise_segmentation' of https://github.com/qurator-spk/eynollah into integrate-training-from-sbb_pixelwise_segmentation
2025-10-16 20:50:24 +02:00
kba
af5abb77fd
Merge branch 'main' into integrate-training-from-sbb_pixelwise_segmentation
2025-10-16 20:50:16 +02:00
kba
d2f0a43088
📝 changelog
2025-10-16 20:46:49 +02:00
Konstantin Baierer
3bd3faef68
Merge pull request #193 from qurator-spk/training-installation
...
Training installation
2025-10-16 20:39:17 +02:00
kba
1e66c85222
Merge branch 'integrate-training-from-sbb_pixelwise_segmentation' into training-installation
2025-10-16 16:18:02 +02:00
kba
bd8c8bfeac
training: pin numpy to <1.24 as well
2025-10-16 16:15:31 +02:00
Robert Sachunsky
948c8c3441
join_polygons: try to catch rare case of MultiPolygon
2025-10-15 16:58:17 +02:00
kba
f485dd4181
📦 v0.6.0rc2
2025-10-14 16:10:50 +02:00
kba
c1f0158806
📝 changelog
2025-10-14 14:53:15 +02:00
kba
7daa0a1bd5
Merge branch 'fix-196' into prepare-v0.6.0rc2
2025-10-14 14:52:36 +02:00
kba
2febf53479
📝 changelog
2025-10-14 14:52:31 +02:00
Robert Sachunsky
8299e7009a
setup_models: avoid unnecessarily loading region_fl
2025-10-14 14:27:32 +02:00
Robert Sachunsky
e8b7212f36
polygon2contour: avoid uint for coords
...
(introduced in a433c736 to make consistent with
`filter_contours_area_of_image`, but actually
np.uint is prone to create overflows downstream)
2025-10-14 14:27:26 +02:00
kba
745cf3be48
XML encoding should be utf-8 not utf8
...
... and should use OCR-D's generateDS PAGE API consistently
2025-10-10 16:39:17 +02:00
kba
2056a8bdb9
📦 v0.6.0rc1
2025-10-10 16:32:47 +02:00
Robert Sachunsky
4e9a1618c3
layout: refactor model setup, allow loading custom versions
...
- simplify definition of (defaults for) model versions
- unify loading of loadable models (depending on mode)
- use `self.models` dict instead of `self.model_*` attributes
- add `model_versions` kwarg / `--model_version` CLI option
2025-10-10 03:18:09 +02:00
Robert Sachunsky
374818de11
📝 update changelog for 5725e4f
2025-10-09 23:11:05 +02:00
Robert Sachunsky
c4cb16c2a8
simplify
...
(`skip_layout_and_reading_order` is already an attr)
2025-10-09 23:05:50 +02:00
Robert Sachunsky
ecb53056f2
Merge branch 'main' of https://github.com/qurator-spk/eynollah into loky-with-shm-for-175-rebuilt
2025-10-09 22:54:11 +02:00
Robert Sachunsky
d96af425a7
Merge pull request #4 from bertsky/loky-with-shm-for-175-rebuilt-refactored
...
refactoring for 192: speedup and improvements
2025-10-09 22:18:53 +02:00
Robert Sachunsky
cab392601e
📝 update changelog
2025-10-09 20:14:11 +02:00
Robert Sachunsky
e1b56d97da
CI: lint with ruff
2025-10-09 20:14:11 +02:00