eynollah

mirror of https://github.com/qurator-spk/eynollah.git synced 2026-08-03 09:22:32 +02:00

Author	SHA1	Message	Date
Robert Sachunsky	f03124f747	training.train: simplify+fix classification data loaders… - unify `generate_data_from_folder_training` w/ `..._evaluation` - instead of recreating array after every batch, just zero out - cast image results to uint8 instead of uint16 - cast categorical results to float instead of int	2026-02-05 17:12:48 +01:00
Robert Sachunsky	82d649061a	training.train: fix F1 metric score setup	2026-02-05 17:12:48 +01:00
Robert Sachunsky	5c7801a1d6	training.train: simplify config args for model builder	2026-02-05 17:12:48 +01:00
Robert Sachunsky	4a65ee0c67	training.train: more config dependencies… - make more config_params keys dependent on each other - re-order accordingly - in main, initialise them (as kwarg), so sacred actually allows overriding them by named config file	2026-02-05 11:53:19 +01:00
Robert Sachunsky	7562317da5	training: fix+simplify `load_model` logic for `continue_training` - add missing combination `transformer` (w/ patch encoder and `weighted_loss`) - add assertion to prevent wrong loss type being configured	2026-02-04 17:35:38 +01:00
Robert Sachunsky	1581094141	training: extend `index_start` to tasks classification and RO	2026-02-04 17:35:12 +01:00
Robert Sachunsky	e85003db4a	training: re-instate `index_start`, reflect cfg dependency - `index_start`: re-introduce cfg key, pass to Keras `Model.fit` as `initial_epoch` - make config keys `index_start` and `dir_of_start_model` dependent on `continue_training` - improve description	2026-02-04 17:32:24 +01:00
Robert Sachunsky	25153ad307	training: add IoU metric	2026-01-29 12:20:42 +01:00
Robert Sachunsky	d1e8a02fd4	training: fix epoch size calculation	2026-01-29 12:20:42 +01:00
Robert Sachunsky	29a0f19cee	training: simplify image preprocessing… - `utils.provide_patches`: split up loop into * `utils.preprocess_img` (single img function) * `utils.preprocess_imgs` (top-level loop) - capture exceptions for all cases (not just some) at top level and with informative logging - avoid repeating / delegating config keys in several places: only as kwargs to `preprocess_img()` - read files into memory only once, then re-use - improve readability (avoiding long lines, repeated code)	2026-01-29 12:20:42 +01:00
Robert Sachunsky	e69b35b49c	training.train.config_params: re-organise to reflect dependencies - re-order keys belonging together logically - make keys dependent on each other	2026-01-29 03:01:57 +01:00
Robert Sachunsky	0372fd7a1e	training.gt_gen_utils: fix+simplify cropping… when parsing `PrintSpace` or `Border` from PAGE-XML, - use `lxml` XPath instead of nested loops - convert points to polygons directly (instead of painting on canvas and retrieving contours) - pass result bbox in slice notation (instead of xywh)	2026-01-29 03:01:57 +01:00
Robert Sachunsky	acda9c84ee	training.gt_gen_utils: improve XML→img path mapping… when matching files in `dir_images` by XML path name stem, * use `dict` instead of `list` to assign reliably * filter out `.xml` files (so input directories can be mixed) * show informative warnings for files which cannot be matched	2026-01-29 03:01:57 +01:00
Robert Sachunsky	eb92760f73	training: download pretrained RESNET weights if missing	2026-01-29 03:01:57 +01:00
Robert Sachunsky	6a81db934e	improve docs/train.md	2026-01-29 03:01:57 +01:00
Robert Sachunsky	87d7ffbdd8	training: use proper Keras callbacks and top-level loop	2026-01-29 03:01:57 +01:00
Robert Sachunsky	3c3effcfda	drop TF1 vernacular, relax TF/Keras and Torch requirements… - do not restrict TF version, but depend on tf-keras and set `TF_USE_LEGACY_KERAS=1` to avoid Keras 3 behaviour - relax Numpy version requirement up to v2 - relax Torch version requirement - drop TF1 session management code - drop TF1 config in favour of TF2 config code for memory growth - training.*: also simplify and limit line length - training.train: always train with TensorBoard callback	2026-01-20 11:34:02 +01:00
Robert Sachunsky	e2754da4f5	adapt to Numpy 1.25 changes… (esp. `np.array(...)` now not allowed on ragged arrays unless `dtype=object`, but then coercing sub-arrays to `object` as well)	2026-01-20 04:04:07 +01:00
Robert Sachunsky	9fdae72e96	utils_ocr.return_textline_contour: gen cv2-like contours (w/ ndim=3, as in all other places)	2025-12-03 03:04:46 +01:00
Robert Sachunsky	ad8f8167c2	separate_lines/_vertical: gen cv2-like contours (w/ ndim=3, as in all other places)	2025-12-03 00:58:26 +01:00
Robert Sachunsky	43a95842bd	writer: also ensure validity after scaling	2025-12-02 16:35:32 +01:00
Robert Sachunsky	56e73bf72f	deskewing: add a 2nd stage for precision after selecting the optimum angle on the original search range, narrow down around in the vicinity with half the range (adding computational costs, but gaining precision)	2025-11-28 18:27:58 +01:00
Robert Sachunsky	adcea47bc0	return_boxes_of_images_by_order_of_reading_new: always erode when passing the text region mask, do not apply erosion only if there are more than 2 columns, but iff `not erosion_hurts` (consistent with `find_num_col`'s expectations and making it as easy to find the column gaps on 1 and 2-column pages as on multi-column pages)	2025-11-28 18:23:59 +01:00
Robert Sachunsky	5a3de3b42d	column detection: improve, aided by vseps whenever possible - `find_number_of_columns_in_document`: retain vertical separators and pass to `find_num_col` for each vertical split - `return_boxes_of_images_by_order_of_reading_new`: reconstruct the vertical separators from the segmentation mask and the separator bboxes; pass it on to `find_num_col` everywhere - `return_boxes_of_images_by_order_of_reading_new`: no need to try-catch `find_num_col` anymore - `return_boxes_of_images_by_order_of_reading_new`: when a vertical split has too few columns, * do not raise but lower the threshold `multiplier` responsible for allowing gaps as column boundaries * do not pass the `num_col_classifier` (i.e. expected number of resulting columns) of the entire page to the iterative `find_num_col` for each existing column, but only the portion of that span	2025-11-28 18:14:24 +01:00
Robert Sachunsky	4dd40c542b	find_num_col: add optional criterion - sum of vertical separators when searching for gaps between text regions, consider the vertical separator mask (if given): add the vertical sum of vertical separators to the peak scores (making column detection more robust if still slighly skewed or partially obscured by multi-column regions, but fg seps are present)	2025-11-28 18:07:15 +01:00
Robert Sachunsky	84d10962f3	return_boxes_of_images_by_order_of_reading_new: improve - when searching for multi-col box makers, pick the right-most allowable column, not the left-most	2025-11-28 18:04:12 +01:00
Robert Sachunsky	5abf0c1097	return_boxes_of_images_by_order_of_reading_new: improve - when analysing regions spanning across columns, disregard tiny regions (smaller than half the median size) - if a region spans across columns just by a tiny fraction, and therefore is not good enough for a multi-col separator, then it should also not be good enough for a multi-col box maker	2025-11-28 17:58:44 +01:00
Robert Sachunsky	b71bb80e3a	return_boxes_of_images_by_order_of_reading_new: fix `4abc2ff5` (forgot to also flip `regions_with_separators` if right2left)	2025-11-28 17:57:10 +01:00
Robert Sachunsky	a527d7a10d	combine_hor_lines_and_delete_cross_points: improve - avoid unnecessary `fillPoly` (we already have the mask) - do not merge hseps if vseps interfere - remove old criterion (based on total length of hseps) - create new criterion (no x overlap and x close to each other) - rename identifiers: * `sum_dis` → `sum_xspan` * `diff_max_min_uniques` → `tot_xspan` * np.std / np.mean → `dev_xspan` - remove rule cutting around the center of crossing seps (which is unnecessary and creates small isolated seps at the center, unrelated to the actual crossing points) - create rule cutting hseps by vseps _prior_ to merging	2025-11-28 17:34:11 +01:00
Robert Sachunsky	5c12b6a851	combine_hor_lines_and_delete_cross_points: simplify and rename - `x_width_smaller_than_acolumn_width` → `avg_col_width` - `len_lines_bigger_than_x_width_smaller_than_acolumn_width` → `nseps_wider_than_than_avg_col_width` - `img_in_hor` → `img_p_in_hor` (analogous to vertical)	2025-11-28 17:27:12 +01:00
Robert Sachunsky	06cb9d1d31	combine_hor_lines_and_delete_cross_points: fix 1-off px bug when eroding the vertical separator mask (by slicing), avoid leaving 1px strips	2025-11-28 17:08:39 +01:00
Robert Sachunsky	38d91673b1	combine_hor_lines_and_delete_cross_points: get external contours instead of tree without looking at the actual hierarchy (to prevent retrieving holes as separators)	2025-11-28 16:50:08 +01:00
Robert Sachunsky	ee59a6809d	contours_in_same_horizon: fix `5d15941b`	2025-11-28 16:17:09 +01:00
Robert Sachunsky	e428e7ad78	ensure separators stay within image bounds	2025-11-16 16:35:18 +01:00
Robert Sachunsky	406288b1fe	fixup `72d059f3`: forgot to update other writer calls	2025-11-16 16:32:45 +01:00
Robert Sachunsky	028ed16921	adapt ocrd-sbb-binarize	2025-11-15 17:17:37 +01:00
Robert Sachunsky	49ab269e08	fix typos found by ruff	2025-11-15 15:49:51 +01:00
Robert Sachunsky	72d059f3c9	reading order: simplify assignment / counting - `do_order_of_regions`: simplify aggregating per-box orders for paragraphs and headings to overall order passed to `xml_reading_order`; no need for `order_and_id_of_texts`, no need to return `id_of_texts_tot` - `do_order_of_regions_with_model`: no need to return `region_ids` - writer: no need to pass `id_of_texts_tot` in `build_pagexml`	2025-11-15 14:34:12 +01:00
Robert Sachunsky	5a778003fd	contour matching for deskewed image: ensure matches for both sides	2025-11-15 14:32:22 +01:00
Robert Sachunsky	3c15c4f7d4	back to `rotate_image` instead of `rotation_image_new` for deskewing (because the latter does not preserve coordinates; it scales, even when resizing the image; this caused coordinate problems when matching deskewed contours)	2025-11-15 14:29:41 +01:00
Robert Sachunsky	4475183f08	improve rules governing column split - reduce `sigma` for smoothing of input to `find_peaks` (so we get deeper gaps between columns) - allow column boundaries closer to the margins (50 instead of 100 or 200 px, 170 instead of 370 px) - allow column boundaries closer to each other (300 instead of 400 px) - add a secondary `grenze` criterion for depth of gap (relative to lowest minimum, if that is smaller than the old criterion relative to lowest maximum) - for calls to `find_num_col` within parts of a page, do allow unbalanced column boundaries	2025-11-14 13:15:09 +01:00
Robert Sachunsky	4abc2ff572	rewrite/simplify manual reading order using recursive algorithm - rename `return_x_start_end_mothers_childs_and_type_of_reading_order` → `return_multicol_separators_x_start_end`, and drop all the analysis pertaining to mother/child relationships and full-span separators, also drop the separator unification rules; instead of the latter, try to combine neighbouring separators more generally: join column spans iff there is nothing in between (which also necessitates passing the region mask), and keep only one of every such redundant pair; add the top (of each page part) as full-span separator up front, and return separators already ordered by y - `return_boxes_of_images_by_order_of_reading_new`: - also pass regions with separators, so they do not have to be reconstructed from the separator coordinates, and also contain images and other non-text region types, when trying to elongate separators to maximize their span (without introducing overlaps) - determine connected components of the region mask, i.e. labels and their respective bboxes, in order to 1. gain additional multi-column separators, if possible 2. avoid cutting through regions which do cross column boundaries later on - whenever adding a new bbox, first look up the label map to see if there are any multi-column regions extending to the right of the current column; if there are, then advance not just one column to the right, but as many as necessary to avoid cutting through these regions - new core algorithm: iterate separators sorted by y and then column by column, but whenever the next separator ends in the same column as the current one or even further left, recurse (i.e. finish that span first before continuing with the top iteration)	2025-11-14 13:14:53 +01:00
Robert Sachunsky	95f76081d1	rename some more identifiers: - `lines` → `seps` (to distinguish from textlines) - `text_regions_p_1_n` → `text_regions_p_d` (because all other deskewed variables are called like this) - `pixel` → `label`	2025-11-14 13:13:50 +01:00
Robert Sachunsky	1a76ce177d	do_order_of_regions: round contour centers (so we can be sure they do not fall through the "pixel cracks": bboxes are delimited by integers, and we do not want to assign contours between boxes)	2025-11-14 13:08:10 +01:00
Robert Sachunsky	19b2c3fa42	reading order: improve handling of headings and horizontal seps - drop connected components analysis to test overlaps between horizontal separators and (horizontal) neighbours (introduced in ab17a927) - instead of converting headings to topline and baseline during `find_number_of_columns_in_document` (introduced in 9f1595d7), add them to the matrix unchanged, but mark as extra type (besides horizontal and vertical separtors) - convert headings to toplines and baselines no earlier than in `return_boxes_of_images_by_order_of_reading_new` - for both headings and horizontal separators, if they already span multiple columns, check if they would overlap (horizontal) neighbours by looking at successively larger (left and right) intervals of columns (and pick the largest elongation which does not introduce any overlaps)	2025-10-25 13:36:35 +02:00
Robert Sachunsky	3367462d18	`return_boxes_of_images_by_order_of_reading_new`: change arg order	2025-10-25 13:36:24 +02:00
Robert Sachunsky	a2a9fe5117	`delete_separator_around`: simplify, eynollah: identifiers - use array instead of list operations - rename identifiers: - `pixel` → `label` - `line` → `sep`	2025-10-25 13:36:17 +02:00
Robert Sachunsky	3ebbc2d693	`return_boxes_of_images_by_order_of_reading_new`: indent (by removing unnecessary conditional)	2025-10-25 13:36:06 +02:00
Robert Sachunsky	66a0e55e49	`return_boxes_of_images_by_order_of_reading_new`: avoid oversplits when y slice (`top:bot`) is not a significant part of the page, viz. less than 22% (as in `find_number_of_columns_in_document`), avoid forcing `find_num_col` to reach `num_col_classifier` (allows large headers not to be split up and thus better ordered)	2025-10-25 13:35:56 +02:00
Robert Sachunsky	6fbb5f8a12	`return_boxes_of_images_by_order_of_reading_new`: simplify - array instead of list operations - add better plotting (but commented out) - add more debug printing (but commented out) - add more inline comments for documentation - rename identifiers to make more readable: - `cy_hor_diff` → `y_max_hor_some` (because the ymax gets passed) - `lines` → `seps` - `y_type_2` → `y_mid` - `y_diff_type_2` → `y_max` - `y_lines_by_order` → `y_mid_by_order` - `y_lines_without_mother` → `y_mid_without_mother` - `y_lines_with_child_without_mother` → `y_mid_with_child_without_mother` - `y_column` → `y_mid_column` - `y_column_nc` → `y_mid_column_nc` - `y_all_between_nm_wc` → `y_mid_between_nm_wc` - `lines_so_close_to_top_separator` → `seps_too_close_to_top_separator` - `y_in_cols` and `y_down` → `y_mid_next` - use `pairwise()` `nc_top:nc_bot` instead of `i_c` indexing	2025-10-25 13:35:44 +02:00

1 2 3 4 5 ...

1203 commits