eynollah

mirror of https://github.com/qurator-spk/eynollah.git synced 2026-02-21 00:41:56 +01:00

Author	SHA1	Message	Date
Robert Sachunsky	9b66867c21	training.models: re-use transformer builder code	2026-02-17 18:09:15 +01:00
Robert Sachunsky	daa084c367	training.models: re-use UNet decoder builder code	2026-02-17 18:09:15 +01:00
Robert Sachunsky	fcd10c3956	training.models: re-use RESNET50 builder (+weight init) code	2026-02-17 18:09:15 +01:00
Robert Sachunsky	4414f7b89b	training.models.vit_resnet50_unet: re-use `IMAGE_ORDERING`	2026-02-17 14:18:32 +01:00
Robert Sachunsky	7888fa5968	training: remove `data_gen` in favor of tf.data pipelines instead of looping over file pairs indefinitely, yielding Numpy arrays: re-use `keras.utils.image_dataset_from_directory` here as well, but with img/label generators zipped together (thus, everything will already be loaded/prefetched on the GPU)	2026-02-17 12:44:45 +01:00
Robert Sachunsky	83c2408192	training.utils.data_gen: avoid repeated array allocation	2026-02-17 12:44:45 +01:00
Robert Sachunsky	514a897dd5	training.train: assert n_epochs vs. index_start	2026-02-17 12:44:45 +01:00
Robert Sachunsky	37338049af	training: use relative imports	2026-02-17 12:44:45 +01:00
Robert Sachunsky	7b7ef041ec	training.models: use asymmetric zero padding instead of lambda layer	2026-02-17 12:44:45 +01:00
Robert Sachunsky	ee4bffd81d	training.train: simplify transformer cfg checks	2026-02-17 12:44:45 +01:00
Robert Sachunsky	53252a59c6	training.models: fix glitch introduced in `3a73ccca`	2026-02-17 12:44:45 +01:00
Robert Sachunsky	ea285124ce	fix Patches/PatchEncoder (make configurable again)	2026-02-17 12:44:45 +01:00
Robert Sachunsky	2492c257c6	ocrd-tool.json: re-instante light_version and textline_light dummies for backwards compatibility	2026-02-07 16:52:54 +01:00
Robert Sachunsky	bd282a594d	training follow-up: - use relative imports - use tf.keras everywhere (and ensure v2) - `weights_ensembling`: * use `Patches` and `PatchEncoder` from .models * drop TF1 stuff * make function / CLI more flexible (expect list of checkpoint dirs instead of single top-level directory) - train for `classification`: delegate to `weights_ensembling.run_ensembling`	2026-02-07 16:34:55 +01:00
Robert Sachunsky	27f43c175f	Merge branch 'main' into ro-fixes and resolve conflicts… major conflicts resolved manually: - branches for non-`light` segmentation already removed in main - Keras/TF setup and no TF1 sessions, esp. in new ModelZoo - changes to binarizer and its CLI (`mode`, `overwrite`, `run_single()`) - writer: `build...` w/ kwargs instead of positional - training for segmentation/binarization/enhancement tasks: * drop unused `generate_data_from_folder()` * simplify `preprocess_imgs()`: turn `preprocess_img()`, `get_patches()` and `get_patches_num_scale_new()` into generators, only writing result files in the caller (top-level loop) instead of passing output directories and file counter - training for new OCR task: * `train`: put keys into additional `config_params` where they belong, resp. (conditioned under existing keys), and w/ better documentation * `train`: add new keys as kwargs to `run()` to make usable * `utils`: instead of custom data loader `data_gen_ocr()`, re-use existing `preprocess_imgs()` (for cfg capture and top-level loop), but extended w/ new kwargs and calling new `preprocess_img_ocr()`; the latter as single-image generator (also much simplified) * `train`: use tf.data loader pipeline from that generator w/ standard mechanisms for batching, shuffling, prefetching etc. * `utils` and `train`: instead of `vectorize_label`, use `Dataset.padded_batch` * add TensorBoard callback and re-use our checkpoint callback * also use standard Keras top-level loop for training still problematic (substantially unresolved): - `Patches` now only w/ fixed implicit size (ignoring training config params) - `PatchEncoder` now only w/ fixed implicit num patches and projection dim (ignoring training config params)	2026-02-07 14:05:56 +01:00
Robert Sachunsky	6944d31617	modify manual RO preference… in `return_boxes_of_images_by_order_of_reading_new`, when the next multicol separator ends in the same column, do not recurse into subspan if the next starts earlier (but continue with top span to the right first)	2026-02-05 17:58:32 +01:00
Robert Sachunsky	d047327a1f	Merge pull request #5 from bertsky/ro-fixes-update-deps update deps, refactor training	2026-02-05 17:36:50 +01:00
Robert Sachunsky	0d3a8eacba	improve/update docs/train.md	2026-02-05 17:12:48 +01:00
Robert Sachunsky	b1633dfc7c	training.generate_gt: for RO, skip files if regionRefs are missing	2026-02-05 17:12:48 +01:00
Robert Sachunsky	5d0c26b629	training.train: use std Keras data loader for classification (much more efficient, works with std F1 metric)	2026-02-05 17:12:48 +01:00
Robert Sachunsky	f03124f747	training.train: simplify+fix classification data loaders… - unify `generate_data_from_folder_training` w/ `..._evaluation` - instead of recreating array after every batch, just zero out - cast image results to uint8 instead of uint16 - cast categorical results to float instead of int	2026-02-05 17:12:48 +01:00
Robert Sachunsky	82d649061a	training.train: fix F1 metric score setup	2026-02-05 17:12:48 +01:00
Robert Sachunsky	5c7801a1d6	training.train: simplify config args for model builder	2026-02-05 17:12:48 +01:00
Robert Sachunsky	4a65ee0c67	training.train: more config dependencies… - make more config_params keys dependent on each other - re-order accordingly - in main, initialise them (as kwarg), so sacred actually allows overriding them by named config file	2026-02-05 11:53:19 +01:00
Robert Sachunsky	7562317da5	training: fix+simplify `load_model` logic for `continue_training` - add missing combination `transformer` (w/ patch encoder and `weighted_loss`) - add assertion to prevent wrong loss type being configured	2026-02-04 17:35:38 +01:00
Robert Sachunsky	1581094141	training: extend `index_start` to tasks classification and RO	2026-02-04 17:35:12 +01:00
Robert Sachunsky	e85003db4a	training: re-instate `index_start`, reflect cfg dependency - `index_start`: re-introduce cfg key, pass to Keras `Model.fit` as `initial_epoch` - make config keys `index_start` and `dir_of_start_model` dependent on `continue_training` - improve description	2026-02-04 17:32:24 +01:00
kba	586077fbcd	📦 v0.7.0	2026-01-30 16:40:55 +01:00
kba	4ade0f788f	📝 changelog	2026-01-29 17:33:35 +01:00
kba	f13560726e	Merge remote-tracking branch 'origin/adding-cnn-rnn-training-script' into 2026-01-29-training # Conflicts: # src/eynollah/training/inference.py	2026-01-29 17:32:08 +01:00
Robert Sachunsky	25153ad307	training: add IoU metric	2026-01-29 12:20:42 +01:00
Robert Sachunsky	d1e8a02fd4	training: fix epoch size calculation	2026-01-29 12:20:42 +01:00
Robert Sachunsky	29a0f19cee	training: simplify image preprocessing… - `utils.provide_patches`: split up loop into * `utils.preprocess_img` (single img function) * `utils.preprocess_imgs` (top-level loop) - capture exceptions for all cases (not just some) at top level and with informative logging - avoid repeating / delegating config keys in several places: only as kwargs to `preprocess_img()` - read files into memory only once, then re-use - improve readability (avoiding long lines, repeated code)	2026-01-29 12:20:42 +01:00
kba	87190f8997	Merge branch 'adding-cnn-rnn-training-script-rfct' into 2026-01-29-training # Conflicts: # src/eynollah/training/models.py	2026-01-29 10:27:36 +01:00
kba	a76de1e182	Merge branch 'adding-cnn-rnn-training-script' into 2026-01-29-training	2026-01-29 10:26:34 +01:00
kba	ef3cf02877	Merge branch 'ruff-training' into 2026-01-29-training	2026-01-29 10:26:14 +01:00
Robert Sachunsky	e69b35b49c	training.train.config_params: re-organise to reflect dependencies - re-order keys belonging together logically - make keys dependent on each other	2026-01-29 03:01:57 +01:00
Robert Sachunsky	0372fd7a1e	training.gt_gen_utils: fix+simplify cropping… when parsing `PrintSpace` or `Border` from PAGE-XML, - use `lxml` XPath instead of nested loops - convert points to polygons directly (instead of painting on canvas and retrieving contours) - pass result bbox in slice notation (instead of xywh)	2026-01-29 03:01:57 +01:00
Robert Sachunsky	acda9c84ee	training.gt_gen_utils: improve XML→img path mapping… when matching files in `dir_images` by XML path name stem, * use `dict` instead of `list` to assign reliably * filter out `.xml` files (so input directories can be mixed) * show informative warnings for files which cannot be matched	2026-01-29 03:01:57 +01:00
Robert Sachunsky	eb92760f73	training: download pretrained RESNET weights if missing	2026-01-29 03:01:57 +01:00
Robert Sachunsky	6a81db934e	improve docs/train.md	2026-01-29 03:01:57 +01:00
Robert Sachunsky	87d7ffbdd8	training: use proper Keras callbacks and top-level loop	2026-01-29 03:01:57 +01:00
vahidrezanezhad	f9695cd7be	Merge branch 'adding-cnn-rnn-training-script' of https://github.com/qurator-spk/eynollah into adding-cnn-rnn-training-script	2026-01-28 11:52:36 +01:00
vahidrezanezhad	3500167870	weights ensembling for tensorflow models is integrated	2026-01-28 11:52:12 +01:00
vahidrezanezhad	33f6a231bc	fix: prevent crash when printspace is missing in xmls used for label generation	2026-01-26 17:30:26 +01:00
vahidrezanezhad	6ae244bf9b	Fix filename stem extraction using binarization. Restore the CNN-RNN model to its previous version, as setting channels_last alone was insufficient for running on both CPU and GPU. Prevent errors caused by null values in image shape elements.	2026-01-26 15:04:47 +01:00
vahidrezanezhad	30f39e7383	mapregion is added to labels	2026-01-26 13:56:34 +01:00
vahidrezanezhad	c8240905a8	Fix label generation by selecting largest contour when erosion splits shapes	2026-01-26 13:36:24 +01:00
Robert Sachunsky	3c3effcfda	drop TF1 vernacular, relax TF/Keras and Torch requirements… - do not restrict TF version, but depend on tf-keras and set `TF_USE_LEGACY_KERAS=1` to avoid Keras 3 behaviour - relax Numpy version requirement up to v2 - relax Torch version requirement - drop TF1 session management code - drop TF1 config in favour of TF2 config code for memory growth - training.*: also simplify and limit line length - training.train: always train with TensorBoard callback	2026-01-20 11:34:02 +01:00
Robert Sachunsky	e2754da4f5	adapt to Numpy 1.25 changes… (esp. `np.array(...)` now not allowed on ragged arrays unless `dtype=object`, but then coercing sub-arrays to `object` as well)	2026-01-20 04:04:07 +01:00

1 2 3 4 5 ...

1351 commits