eynollah

mirror of https://github.com/qurator-spk/eynollah.git synced 2026-07-14 07:39:15 +02:00

Author	SHA1	Message	Date
Robert Sachunsky	4abc2ff572	rewrite/simplify manual reading order using recursive algorithm - rename `return_x_start_end_mothers_childs_and_type_of_reading_order` → `return_multicol_separators_x_start_end`, and drop all the analysis pertaining to mother/child relationships and full-span separators, also drop the separator unification rules; instead of the latter, try to combine neighbouring separators more generally: join column spans iff there is nothing in between (which also necessitates passing the region mask), and keep only one of every such redundant pair; add the top (of each page part) as full-span separator up front, and return separators already ordered by y - `return_boxes_of_images_by_order_of_reading_new`: - also pass regions with separators, so they do not have to be reconstructed from the separator coordinates, and also contain images and other non-text region types, when trying to elongate separators to maximize their span (without introducing overlaps) - determine connected components of the region mask, i.e. labels and their respective bboxes, in order to 1. gain additional multi-column separators, if possible 2. avoid cutting through regions which do cross column boundaries later on - whenever adding a new bbox, first look up the label map to see if there are any multi-column regions extending to the right of the current column; if there are, then advance not just one column to the right, but as many as necessary to avoid cutting through these regions - new core algorithm: iterate separators sorted by y and then column by column, but whenever the next separator ends in the same column as the current one or even further left, recurse (i.e. finish that span first before continuing with the top iteration)	2025-11-14 13:14:53 +01:00
Robert Sachunsky	95f76081d1	rename some more identifiers: - `lines` → `seps` (to distinguish from textlines) - `text_regions_p_1_n` → `text_regions_p_d` (because all other deskewed variables are called like this) - `pixel` → `label`	2025-11-14 13:13:50 +01:00
Robert Sachunsky	1a76ce177d	do_order_of_regions: round contour centers (so we can be sure they do not fall through the "pixel cracks": bboxes are delimited by integers, and we do not want to assign contours between boxes)	2025-11-14 13:08:10 +01:00
Robert Sachunsky	19b2c3fa42	reading order: improve handling of headings and horizontal seps - drop connected components analysis to test overlaps between horizontal separators and (horizontal) neighbours (introduced in ab17a927) - instead of converting headings to topline and baseline during `find_number_of_columns_in_document` (introduced in 9f1595d7), add them to the matrix unchanged, but mark as extra type (besides horizontal and vertical separtors) - convert headings to toplines and baselines no earlier than in `return_boxes_of_images_by_order_of_reading_new` - for both headings and horizontal separators, if they already span multiple columns, check if they would overlap (horizontal) neighbours by looking at successively larger (left and right) intervals of columns (and pick the largest elongation which does not introduce any overlaps)	2025-10-25 13:36:35 +02:00
Robert Sachunsky	3367462d18	`return_boxes_of_images_by_order_of_reading_new`: change arg order	2025-10-25 13:36:24 +02:00
Robert Sachunsky	a2a9fe5117	`delete_separator_around`: simplify, eynollah: identifiers - use array instead of list operations - rename identifiers: - `pixel` → `label` - `line` → `sep`	2025-10-25 13:36:17 +02:00
Robert Sachunsky	3ebbc2d693	`return_boxes_of_images_by_order_of_reading_new`: indent (by removing unnecessary conditional)	2025-10-25 13:36:06 +02:00
Robert Sachunsky	66a0e55e49	`return_boxes_of_images_by_order_of_reading_new`: avoid oversplits when y slice (`top:bot`) is not a significant part of the page, viz. less than 22% (as in `find_number_of_columns_in_document`), avoid forcing `find_num_col` to reach `num_col_classifier` (allows large headers not to be split up and thus better ordered)	2025-10-25 13:35:56 +02:00
Robert Sachunsky	6fbb5f8a12	`return_boxes_of_images_by_order_of_reading_new`: simplify - array instead of list operations - add better plotting (but commented out) - add more debug printing (but commented out) - add more inline comments for documentation - rename identifiers to make more readable: - `cy_hor_diff` → `y_max_hor_some` (because the ymax gets passed) - `lines` → `seps` - `y_type_2` → `y_mid` - `y_diff_type_2` → `y_max` - `y_lines_by_order` → `y_mid_by_order` - `y_lines_without_mother` → `y_mid_without_mother` - `y_lines_with_child_without_mother` → `y_mid_with_child_without_mother` - `y_column` → `y_mid_column` - `y_column_nc` → `y_mid_column_nc` - `y_all_between_nm_wc` → `y_mid_between_nm_wc` - `lines_so_close_to_top_separator` → `seps_too_close_to_top_separator` - `y_in_cols` and `y_down` → `y_mid_next` - use `pairwise()` `nc_top:nc_bot` instead of `i_c` indexing	2025-10-25 13:35:44 +02:00
Robert Sachunsky	6cc5900943	`find_num_col`: add better plotting (but commented out)	2025-10-25 13:35:34 +02:00
Robert Sachunsky	5d15941b35	`contours_in_same_horizon`: simplify - array instead of list operations - return array of index pairs instead of list objects	2025-10-25 13:35:26 +02:00
Robert Sachunsky	acee4c1bfe	`find_number_of_columns_in_document`: simplify	2025-10-25 13:35:18 +02:00
Robert Sachunsky	b2a79cc6ed	`return_x_start_end_mothers_childs_and_type_of_reading_order`: fix+1 when calculating `reading_order_type`, upper limit on column range (`x_end`) needs to be `+1` here as well	2025-10-25 13:35:12 +02:00
Robert Sachunsky	e2dfec75fb	`return_x_start_end_mothers_childs_and_type_of_reading_order`: simplify and document - simplify - rename identifiers to make readable: - `y_sep` → `y_mid` (because the cy gets passed) - `y_diff` → `y_max` (because the ymax gets passed) - array instead of list operations - add docstring and in-line comments - return (zero-length) numpy array instead of empty list	2025-10-25 13:35:06 +02:00
Robert Sachunsky	0fc4b2535d	`return_boxes_of_images_by_order_of_reading_new`: fix no-mother case - when handling lines without mother, and biggest line already accounts for all columns, but some are too close to the top and therefore must be removed, avoid invalidating `biggest` index, causing `IndexError` - remove try-catch (now unnecessary) - array instead of list operations	2025-10-25 13:34:58 +02:00
Robert Sachunsky	7c3e418588	`return_boxes_of_images_by_order_of_reading_new`: simplify - enumeration instead of indexing - array instead of list operations - add better plotting (but commented out)	2025-10-25 13:34:52 +02:00
Robert Sachunsky	cd35241e81	`find_number_of_columns_in_document`: split headings at top+baseline regarding `splitter_y` result, for headings, instead of cutting right through them via center line, add their toplines and baselines as if they were horizontal separators	2025-10-25 13:34:35 +02:00
Robert Sachunsky	5a0e4c3b0f	`find_number_of_columns_in_document`: improve splitter rule extend horizontal separators to full img width if they do not overlap any other regions (only as regards to returned `splitter_y` result, but without changing returned separators mask)	2025-10-20 17:41:50 +02:00
Robert Sachunsky	542d38ab43	`find_number_of_columns_in_document`: simplify, rename `line`→`seps`	2025-10-20 17:41:49 +02:00
Robert Sachunsky	d3d599b010	`order_of_regions`: add better plotting (but commented out)	2025-10-20 17:41:47 +02:00
Robert Sachunsky	c43a825d1d	`order_of_regions`: filter out-of-image peaks	2025-10-20 17:41:47 +02:00
Robert Sachunsky	48761c3e12	`find_num_col`: simplify, add better plotting (but commented out)	2025-10-20 17:41:45 +02:00
Robert Sachunsky	184927fb54	`find_num_cols`: re-sort peaks when cutting n-best `num_col_classifier`	2025-10-20 17:41:44 +02:00
Robert Sachunsky	086c1880ac	binarization: add option `--overwrite`, skip existing outputs (also, simplify `run` and separate `run_single`)	2025-10-20 17:40:52 +02:00
kba	38c028c6b5	📦 v0.6.0	2025-10-17 10:36:30 +02:00
kba	ca8edb35e3	📝 changelog	2025-10-17 10:35:13 +02:00
kba	50e8b2c266	Merge branch 'integrate-training-from-sbb_pixelwise_segmentation'	2025-10-17 10:33:04 +02:00
kba	46d25647f7	📝 changelog	2025-10-17 10:32:15 +02:00
Robert Sachunsky	2ac01ecacc	join_polygons: try to catch rare case of MultiPolygon	2025-10-17 10:31:51 +02:00
kba	2e0fb64dcb	disable ruff check for training code for now	2025-10-16 21:29:37 +02:00
kba	76c13bcfd7	Merge branch 'integrate-training-from-sbb_pixelwise_segmentation' of https://github.com/qurator-spk/eynollah into integrate-training-from-sbb_pixelwise_segmentation	2025-10-16 20:50:24 +02:00
kba	af5abb77fd	Merge branch 'main' into integrate-training-from-sbb_pixelwise_segmentation	2025-10-16 20:50:16 +02:00
kba	d2f0a43088	📝 changelog	2025-10-16 20:46:49 +02:00
Konstantin Baierer	3bd3faef68	Merge pull request #193 from qurator-spk/training-installation Training installation	2025-10-16 20:39:17 +02:00
kba	1e66c85222	Merge branch 'integrate-training-from-sbb_pixelwise_segmentation' into training-installation	2025-10-16 16:18:02 +02:00
kba	bd8c8bfeac	training: pin numpy to <1.24 as well	2025-10-16 16:15:31 +02:00
Robert Sachunsky	948c8c3441	join_polygons: try to catch rare case of MultiPolygon	2025-10-15 16:58:17 +02:00
kba	f485dd4181	📦 v0.6.0rc2	2025-10-14 16:10:50 +02:00
kba	c1f0158806	📝 changelog	2025-10-14 14:53:15 +02:00
kba	7daa0a1bd5	Merge branch 'fix-196' into prepare-v0.6.0rc2	2025-10-14 14:52:36 +02:00
kba	2febf53479	📝 changelog	2025-10-14 14:52:31 +02:00
Robert Sachunsky	8299e7009a	`setup_models`: avoid unnecessarily loading `region_fl`	2025-10-14 14:27:32 +02:00
Robert Sachunsky	e8b7212f36	`polygon2contour`: avoid uint for coords (introduced in `a433c736` to make consistent with `filter_contours_area_of_image`, but actually np.uint is prone to create overflows downstream)	2025-10-14 14:27:26 +02:00
kba	745cf3be48	XML encoding should be utf-8 not utf8 ... and should use OCR-D's generateDS PAGE API consistently	2025-10-10 16:39:17 +02:00
kba	2056a8bdb9	📦 v0.6.0rc1	2025-10-10 16:32:47 +02:00
Robert Sachunsky	4e9a1618c3	layout: refactor model setup, allow loading custom versions - simplify definition of (defaults for) model versions - unify loading of loadable models (depending on mode) - use `self.models` dict instead of `self.model_*` attributes - add `model_versions` kwarg / `--model_version` CLI option	2025-10-10 03:18:09 +02:00
Robert Sachunsky	374818de11	📝 update changelog for `5725e4f`	2025-10-09 23:11:05 +02:00
Robert Sachunsky	c4cb16c2a8	simplify (`skip_layout_and_reading_order` is already an attr)	2025-10-09 23:05:50 +02:00
Robert Sachunsky	ecb53056f2	Merge branch 'main' of https://github.com/qurator-spk/eynollah into loky-with-shm-for-175-rebuilt	2025-10-09 22:54:11 +02:00
Robert Sachunsky	d96af425a7	Merge pull request #4 from bertsky/loky-with-shm-for-175-rebuilt-refactored refactoring for 192: speedup and improvements	2025-10-09 22:18:53 +02:00

1 2 3 4 5 ...

1162 commits