kba
51d2680d9c
wip
2025-10-27 11:44:59 +01:00
Robert Sachunsky
19b2c3fa42
reading order: improve handling of headings and horizontal seps
...
- drop connected components analysis to test overlaps between
horizontal separators and (horizontal) neighbours (introduced
in ab17a927)
- instead of converting headings to topline and baseline during
`find_number_of_columns_in_document` (introduced in 9f1595d7),
add them to the matrix unchanged, but mark as extra type
(besides horizontal and vertical separtors)
- convert headings to toplines and baselines no earlier than in
`return_boxes_of_images_by_order_of_reading_new`
- for both headings and horizontal separators, if they already
span multiple columns, check if they would overlap (horizontal)
neighbours by looking at successively larger (left and right)
intervals of columns (and pick the largest elongation which
does not introduce any overlaps)
2025-10-25 13:36:35 +02:00
Robert Sachunsky
3367462d18
return_boxes_of_images_by_order_of_reading_new: change arg order
2025-10-25 13:36:24 +02:00
Robert Sachunsky
a2a9fe5117
delete_separator_around: simplify, eynollah: identifiers
...
- use array instead of list operations
- rename identifiers:
- `pixel` → `label`
- `line` → `sep`
2025-10-25 13:36:17 +02:00
Robert Sachunsky
3ebbc2d693
return_boxes_of_images_by_order_of_reading_new: indent
...
(by removing unnecessary conditional)
2025-10-25 13:36:06 +02:00
Robert Sachunsky
66a0e55e49
return_boxes_of_images_by_order_of_reading_new: avoid oversplits
...
when y slice (`top:bot`) is not a significant part of the page,
viz. less than 22% (as in `find_number_of_columns_in_document`),
avoid forcing `find_num_col` to reach `num_col_classifier`
(allows large headers not to be split up and thus better ordered)
2025-10-25 13:35:56 +02:00
Robert Sachunsky
6fbb5f8a12
return_boxes_of_images_by_order_of_reading_new: simplify
...
- array instead of list operations
- add better plotting (but commented out)
- add more debug printing (but commented out)
- add more inline comments for documentation
- rename identifiers to make more readable:
- `cy_hor_diff` → `y_max_hor_some` (because the ymax gets passed)
- `lines` → `seps`
- `y_type_2` → `y_mid`
- `y_diff_type_2` → `y_max`
- `y_lines_by_order` → `y_mid_by_order`
- `y_lines_without_mother` → `y_mid_without_mother`
- `y_lines_with_child_without_mother` → `y_mid_with_child_without_mother`
- `y_column` → `y_mid_column`
- `y_column_nc` → `y_mid_column_nc`
- `y_all_between_nm_wc` → `y_mid_between_nm_wc`
- `lines_so_close_to_top_separator` → `seps_too_close_to_top_separator`
- `y_in_cols` and `y_down` → `y_mid_next`
- use `pairwise()` `nc_top:nc_bot` instead of `i_c` indexing
2025-10-25 13:35:44 +02:00
Robert Sachunsky
6cc5900943
find_num_col: add better plotting (but commented out)
2025-10-25 13:35:34 +02:00
Robert Sachunsky
5d15941b35
contours_in_same_horizon: simplify
...
- array instead of list operations
- return array of index pairs instead of list objects
2025-10-25 13:35:26 +02:00
Robert Sachunsky
acee4c1bfe
find_number_of_columns_in_document: simplify
2025-10-25 13:35:18 +02:00
Robert Sachunsky
b2a79cc6ed
return_x_start_end_mothers_childs_and_type_of_reading_order: fix+1
...
when calculating `reading_order_type`, upper limit on column range
(`x_end`) needs to be `+1` here as well
2025-10-25 13:35:12 +02:00
Robert Sachunsky
e2dfec75fb
return_x_start_end_mothers_childs_and_type_of_reading_order:
...
simplify and document
- simplify
- rename identifiers to make readable:
- `y_sep` → `y_mid` (because the cy gets passed)
- `y_diff` → `y_max` (because the ymax gets passed)
- array instead of list operations
- add docstring and in-line comments
- return (zero-length) numpy array instead of empty list
2025-10-25 13:35:06 +02:00
Robert Sachunsky
0fc4b2535d
return_boxes_of_images_by_order_of_reading_new: fix no-mother case
...
- when handling lines without mother,
and biggest line already accounts for all columns,
but some are too close to the top and therefore must be removed,
avoid invalidating `biggest` index, causing `IndexError`
- remove try-catch (now unnecessary)
- array instead of list operations
2025-10-25 13:34:58 +02:00
Robert Sachunsky
7c3e418588
return_boxes_of_images_by_order_of_reading_new: simplify
...
- enumeration instead of indexing
- array instead of list operations
- add better plotting (but commented out)
2025-10-25 13:34:52 +02:00
Robert Sachunsky
cd35241e81
find_number_of_columns_in_document: split headings at top+baseline
...
regarding `splitter_y` result, for headings, instead of cutting right
through them via center line, add their toplines and baselines as if
they were horizontal separators
2025-10-25 13:34:35 +02:00
kba
ec1fd93dad
wip
2025-10-23 11:58:23 +02:00
kba
883546a6b8
eynollah models package
2025-10-22 17:05:40 +02:00
kba
04bc4a63d0
reorganize model_zoo
2025-10-22 16:04:48 +02:00
kba
d94285b3ea
rewrite model spec data structure
2025-10-22 13:07:35 +02:00
kba
146658f026
eynollah layout: fix trocr_processor model_zoo call
2025-10-22 10:48:26 +02:00
kba
4c8abfe19c
eynollah_ocr: actually replace the model calls
2025-10-22 10:48:26 +02:00
kba
1337461d47
adopt image_enhancer to the zoo
2025-10-21 19:24:55 +02:00
kba
f0c86672f8
adopt mb_ro_on_layout to the zoo
2025-10-21 17:55:08 +02:00
kba
bcffa2e503
adopt binarizer to the zoo
2025-10-21 17:53:24 +02:00
kba
a53d5fc452
update docs/makefile to point to v0.6.0 models
2025-10-21 13:15:57 +02:00
kba
c6b863b13f
typing and asserts
2025-10-21 12:05:27 +02:00
kba
44b75eb36f
cli: model -> model_basedir
2025-10-21 11:05:12 +02:00
kba
062f317d2e
Introduce model_zoo to Eynollah_ocr
2025-10-20 21:14:52 +02:00
kba
d609a532bf
organize imports mostly
2025-10-20 19:46:07 +02:00
kba
48d1198d24
move Eynollah_ocr to separate module
2025-10-20 19:15:31 +02:00
kba
a850ef39ea
factor model loading in Eynollah to EynollahModelZoo
2025-10-20 18:34:44 +02:00
Robert Sachunsky
5a0e4c3b0f
find_number_of_columns_in_document: improve splitter rule
...
extend horizontal separators to full img width if they do not overlap
any other regions
(only as regards to returned `splitter_y` result,
but without changing returned separators mask)
2025-10-20 17:41:50 +02:00
Robert Sachunsky
542d38ab43
find_number_of_columns_in_document: simplify, rename line→seps
2025-10-20 17:41:49 +02:00
Robert Sachunsky
d3d599b010
order_of_regions: add better plotting (but commented out)
2025-10-20 17:41:47 +02:00
Robert Sachunsky
c43a825d1d
order_of_regions: filter out-of-image peaks
2025-10-20 17:41:47 +02:00
Robert Sachunsky
48761c3e12
find_num_col: simplify, add better plotting (but commented out)
2025-10-20 17:41:45 +02:00
Robert Sachunsky
184927fb54
find_num_cols: re-sort peaks when cutting n-best num_col_classifier
2025-10-20 17:41:44 +02:00
Robert Sachunsky
086c1880ac
binarization: add option --overwrite, skip existing outputs
...
(also, simplify `run` and separate `run_single`)
2025-10-20 17:40:52 +02:00
kba
6c89888166
Refactor CLI for consistent logging and late imports
2025-10-17 17:47:59 +02:00
kba
557fb227f3
training/gt_gen_utils: fix type errors, comment out dead code
2025-10-17 14:21:05 +02:00
kba
af74890b2e
training/inference.py: add typing info, organize imports
2025-10-17 14:07:43 +02:00
kba
3a73ccca2e
training/models.py: make imports explicit
2025-10-17 13:45:44 +02:00
kba
38c028c6b5
📦 v0.6.0
2025-10-17 10:36:30 +02:00
kba
76c13bcfd7
Merge branch 'integrate-training-from-sbb_pixelwise_segmentation' of https://github.com/qurator-spk/eynollah into integrate-training-from-sbb_pixelwise_segmentation
2025-10-16 20:50:24 +02:00
kba
af5abb77fd
Merge branch 'main' into integrate-training-from-sbb_pixelwise_segmentation
2025-10-16 20:50:16 +02:00
Robert Sachunsky
948c8c3441
join_polygons: try to catch rare case of MultiPolygon
2025-10-15 16:58:17 +02:00
kba
f485dd4181
📦 v0.6.0rc2
2025-10-14 16:10:50 +02:00
kba
7daa0a1bd5
Merge branch 'fix-196' into prepare-v0.6.0rc2
2025-10-14 14:52:36 +02:00
Robert Sachunsky
8299e7009a
setup_models: avoid unnecessarily loading region_fl
2025-10-14 14:27:32 +02:00
Robert Sachunsky
e8b7212f36
polygon2contour: avoid uint for coords
...
(introduced in a433c736 to make consistent with
`filter_contours_area_of_image`, but actually
np.uint is prone to create overflows downstream)
2025-10-14 14:27:26 +02:00