Commit graph

1379 commits

Author SHA1 Message Date
Robert Sachunsky
c1d8a72edc training: shuffle tf.data pipelines 2026-02-28 20:10:53 +01:00
Robert Sachunsky
1cff937e72 training: make data pipeline in 7888fa5 more efficient 2026-02-28 20:10:53 +01:00
Robert Sachunsky
f8dd5a328c training: make plotting 18607e0f more efficient…
- avoid control dependencies in model path
- store only every 3rd sample
2026-02-28 20:10:53 +01:00
Robert Sachunsky
2d5de8e595 training.models: use bilinear instead of nearest upsampling…
(to benefit from CUDA optimization)
2026-02-27 12:48:28 +01:00
Robert Sachunsky
ba954d6314 training.models: fix daa084c3 2026-02-27 12:47:59 +01:00
Robert Sachunsky
7c3aeda65e training.models: fix 9b66867c 2026-02-27 12:40:56 +01:00
Robert Sachunsky
439ca350dd training: add metric ConfusionMatrix and plot it to TensorBoard 2026-02-26 13:55:37 +01:00
Robert Sachunsky
b6d2440ce1 training.utils.preprocess_imgs: fix polymorphy in 27f43c1
(Functions cannot be both generators and procedures,
 so make this a pure generator and save the image files
 on the caller's side; also avoids passing output
 directories)

Moreover, simplify by moving the `os.listdir` into the function
body (saving lots of extra variable bindings).
2026-02-25 20:39:15 +01:00
Robert Sachunsky
42bab0f935 docs/train: document --missing-printspace=project 2026-02-25 13:18:40 +01:00
Robert Sachunsky
4202a1b2db training.generate-gt.pagexml2label: add --missing-printspace
- keep default (fallback to full page), but warn
- new option `skip`
- new option `project`
2026-02-25 11:16:21 +01:00
Robert Sachunsky
7823ea2c95 training.train: add early stopping for OCR 2026-02-25 00:16:07 +01:00
Robert Sachunsky
36e370aa45 training.train: add validation data for OCR 2026-02-25 00:10:43 +01:00
Robert Sachunsky
b399db3c00 training.models: simplify CTC loss layer 2026-02-24 20:43:50 +01:00
Robert Sachunsky
92fc2bd815 training.train: fix data batching for OCR in 27f43c17 2026-02-24 20:42:08 +01:00
Robert Sachunsky
86b009bc31 training.utils.preprocess_imgs: fix file name stemming 27f43c17 2026-02-24 20:41:08 +01:00
Robert Sachunsky
20a3672be3 training.utils.preprocess_imgs: fix file shuffling in 27f43c17 2026-02-24 20:37:44 +01:00
Robert Sachunsky
658dade0d4 training.config_params: flip_index needed for scaling_flip, too 2026-02-24 20:36:00 +01:00
Robert Sachunsky
abf111de76 training: add metric for (same) number of connected components
(in trying to capture region instance separability)
2026-02-24 17:03:21 +01:00
Robert Sachunsky
18607e0f48 training: plot predictions to TB logs along with training/testing 2026-02-24 17:00:48 +01:00
Robert Sachunsky
56833b3f55 training: fix data representation in 7888fa5
(Eynollah models expet BGR/float instead of RGB/int)
2026-02-24 16:46:19 +01:00
Robert Sachunsky
a9496bbc70 enhancer/mbreorder: use std Keras data loader for classification 2026-02-17 18:39:30 +01:00
Robert Sachunsky
003c88f18a fix double import in 82266f82 2026-02-17 18:23:32 +01:00
Robert Sachunsky
f61effe8ce fix typo in c8240905 2026-02-17 18:20:58 +01:00
Robert Sachunsky
5f71333649 fix missing import in 49261fa9 2026-02-17 18:11:49 +01:00
Robert Sachunsky
67fca82f38 fix missing import in 27f43c17 2026-02-17 18:09:15 +01:00
Robert Sachunsky
6a4163ae56 fix typo in 27f43c17 2026-02-17 18:09:15 +01:00
Robert Sachunsky
c1b5cc92af fix typo in 7562317d 2026-02-17 18:09:15 +01:00
Robert Sachunsky
7bef8fa95a training.train: add verbose=1 consistently 2026-02-17 18:09:15 +01:00
Robert Sachunsky
9b66867c21 training.models: re-use transformer builder code 2026-02-17 18:09:15 +01:00
Robert Sachunsky
daa084c367 training.models: re-use UNet decoder builder code 2026-02-17 18:09:15 +01:00
Robert Sachunsky
fcd10c3956 training.models: re-use RESNET50 builder (+weight init) code 2026-02-17 18:09:15 +01:00
Robert Sachunsky
4414f7b89b training.models.vit_resnet50_unet: re-use IMAGE_ORDERING 2026-02-17 14:18:32 +01:00
Robert Sachunsky
7888fa5968 training: remove data_gen in favor of tf.data pipelines
instead of looping over file pairs indefinitely, yielding
Numpy arrays: re-use `keras.utils.image_dataset_from_directory`
here as well, but with img/label generators zipped together

(thus, everything will already be loaded/prefetched on the GPU)
2026-02-17 12:44:45 +01:00
Robert Sachunsky
83c2408192 training.utils.data_gen: avoid repeated array allocation 2026-02-17 12:44:45 +01:00
Robert Sachunsky
514a897dd5 training.train: assert n_epochs vs. index_start 2026-02-17 12:44:45 +01:00
Robert Sachunsky
37338049af training: use relative imports 2026-02-17 12:44:45 +01:00
Robert Sachunsky
7b7ef041ec training.models: use asymmetric zero padding instead of lambda layer 2026-02-17 12:44:45 +01:00
Robert Sachunsky
ee4bffd81d training.train: simplify transformer cfg checks 2026-02-17 12:44:45 +01:00
Robert Sachunsky
53252a59c6 training.models: fix glitch introduced in 3a73ccca 2026-02-17 12:44:45 +01:00
Robert Sachunsky
ea285124ce fix Patches/PatchEncoder (make configurable again) 2026-02-17 12:44:45 +01:00
Robert Sachunsky
2492c257c6 ocrd-tool.json: re-instante light_version and textline_light dummies for backwards compatibility 2026-02-07 16:52:54 +01:00
Robert Sachunsky
bd282a594d training follow-up:
- use relative imports
- use tf.keras everywhere (and ensure v2)
- `weights_ensembling`:
  * use `Patches` and `PatchEncoder` from .models
  * drop TF1 stuff
  * make function / CLI more flexible (expect list of
    checkpoint dirs instead of single top-level directory)
- train for `classification`: delegate to `weights_ensembling.run_ensembling`
2026-02-07 16:34:55 +01:00
Robert Sachunsky
27f43c175f Merge branch 'main' into ro-fixes and resolve conflicts…
major conflicts resolved manually:

- branches for non-`light` segmentation already removed in main
- Keras/TF setup and no TF1 sessions, esp. in new ModelZoo
- changes to binarizer and its CLI (`mode`, `overwrite`, `run_single()`)
- writer: `build...` w/ kwargs instead of positional
- training for segmentation/binarization/enhancement tasks:
  * drop unused `generate_data_from_folder()`
  * simplify `preprocess_imgs()`: turn `preprocess_img()`, `get_patches()`
    and `get_patches_num_scale_new()` into generators, only writing
    result files in the caller (top-level loop) instead of passing
    output directories and file counter
- training for new OCR task:
  * `train`: put keys into additional `config_params` where they belong,
    resp. (conditioned under existing keys), and w/ better documentation
  * `train`: add new keys as kwargs to `run()` to make usable
  * `utils`: instead of custom data loader `data_gen_ocr()`, re-use
    existing `preprocess_imgs()` (for cfg capture and top-level loop),
    but extended w/ new kwargs and calling new `preprocess_img_ocr()`;
    the latter as single-image generator (also much simplified)
  * `train`: use tf.data loader pipeline from that generator w/ standard
    mechanisms for batching, shuffling, prefetching etc.
  * `utils` and `train`: instead of `vectorize_label`, use `Dataset.padded_batch`
  * add TensorBoard callback and re-use our checkpoint callback
  * also use standard Keras top-level loop for training

still problematic (substantially unresolved):
- `Patches` now only w/ fixed implicit size
  (ignoring training config params)
- `PatchEncoder` now only w/ fixed implicit num patches and projection dim
  (ignoring training config params)
2026-02-07 14:05:56 +01:00
Robert Sachunsky
6944d31617 modify manual RO preference…
in `return_boxes_of_images_by_order_of_reading_new`,
when the next multicol separator ends in the same column,
do not recurse into subspan if the next starts earlier
(but continue with top span to the right first)
2026-02-05 17:58:32 +01:00
Robert Sachunsky
d047327a1f
Merge pull request #5 from bertsky/ro-fixes-update-deps
update deps, refactor training
2026-02-05 17:36:50 +01:00
Robert Sachunsky
0d3a8eacba improve/update docs/train.md 2026-02-05 17:12:48 +01:00
Robert Sachunsky
b1633dfc7c training.generate_gt: for RO, skip files if regionRefs are missing 2026-02-05 17:12:48 +01:00
Robert Sachunsky
5d0c26b629 training.train: use std Keras data loader for classification
(much more efficient, works with std F1 metric)
2026-02-05 17:12:48 +01:00
Robert Sachunsky
f03124f747 training.train: simplify+fix classification data loaders…
- unify `generate_data_from_folder_training` w/ `..._evaluation`
- instead of recreating array after every batch, just zero out
- cast image results to uint8 instead of uint16
- cast categorical results to float instead of int
2026-02-05 17:12:48 +01:00
Robert Sachunsky
82d649061a training.train: fix F1 metric score setup 2026-02-05 17:12:48 +01:00