major conflicts resolved manually:
- branches for non-`light` segmentation already removed in main
- Keras/TF setup and no TF1 sessions, esp. in new ModelZoo
- changes to binarizer and its CLI (`mode`, `overwrite`, `run_single()`)
- writer: `build...` w/ kwargs instead of positional
- training for segmentation/binarization/enhancement tasks:
* drop unused `generate_data_from_folder()`
* simplify `preprocess_imgs()`: turn `preprocess_img()`, `get_patches()`
and `get_patches_num_scale_new()` into generators, only writing
result files in the caller (top-level loop) instead of passing
output directories and file counter
- training for new OCR task:
* `train`: put keys into additional `config_params` where they belong,
resp. (conditioned under existing keys), and w/ better documentation
* `train`: add new keys as kwargs to `run()` to make usable
* `utils`: instead of custom data loader `data_gen_ocr()`, re-use
existing `preprocess_imgs()` (for cfg capture and top-level loop),
but extended w/ new kwargs and calling new `preprocess_img_ocr()`;
the latter as single-image generator (also much simplified)
* `train`: use tf.data loader pipeline from that generator w/ standard
mechanisms for batching, shuffling, prefetching etc.
* `utils` and `train`: instead of `vectorize_label`, use `Dataset.padded_batch`
* add TensorBoard callback and re-use our checkpoint callback
* also use standard Keras top-level loop for training
still problematic (substantially unresolved):
- `Patches` now only w/ fixed implicit size
(ignoring training config params)
- `PatchEncoder` now only w/ fixed implicit num patches and projection dim
(ignoring training config params)
- do not restrict TF version, but depend on tf-keras and
set `TF_USE_LEGACY_KERAS=1` to avoid Keras 3 behaviour
- relax Numpy version requirement up to v2
- relax Torch version requirement
- drop TF1 session management code
- drop TF1 config in favour of TF2 config code for memory growth
- training.*: also simplify and limit line length
- training.train: always train with TensorBoard callback
OK so now numpy is the culprit (shipped unbound via ocrd) which had several deprecations expire with release of v1.24.0 that require changes to our codebase, e.g.
* The deprecation for the aliases np.object, np.bool, np.float, np.complex, np.str, and np.int is expired
* Ragged array creation will now always raise a ValueError unless dtype=object is passed.
See also here: https://numpy.org/devdocs/release/1.24.0-notes.html#expired-deprecations
Cap tensorflow version to <2.12.0 until we have time to adapt to the API changes such as e.g.
* Support for Python 3.11 has been added.
* Support for Python 3.7 has been removed.
See also https://github.com/tensorflow/tensorflow/releases/tag/v2.12.0.
eynollah requires at ocrd >= 2.22.0 for the resource resolving code,
otherwise it fails with an AttributeError. Fix this by bumping up the
requirement.
I bumped it to 2.23.3 so core *also* includes the latest model resource
for eynollah.