Commit graph

1136 commits

Author SHA1 Message Date
kba
50e8b2c266 Merge branch 'integrate-training-from-sbb_pixelwise_segmentation' 2025-10-17 10:33:04 +02:00
kba
46d25647f7 📝 changelog 2025-10-17 10:32:15 +02:00
Robert Sachunsky
2ac01ecacc join_polygons: try to catch rare case of MultiPolygon 2025-10-17 10:31:51 +02:00
kba
2e0fb64dcb disable ruff check for training code for now 2025-10-16 21:29:37 +02:00
kba
76c13bcfd7 Merge branch 'integrate-training-from-sbb_pixelwise_segmentation' of https://github.com/qurator-spk/eynollah into integrate-training-from-sbb_pixelwise_segmentation 2025-10-16 20:50:24 +02:00
kba
af5abb77fd Merge branch 'main' into integrate-training-from-sbb_pixelwise_segmentation 2025-10-16 20:50:16 +02:00
kba
d2f0a43088 📝 changelog 2025-10-16 20:46:49 +02:00
Konstantin Baierer
3bd3faef68
Merge pull request #193 from qurator-spk/training-installation
Training installation
2025-10-16 20:39:17 +02:00
kba
1e66c85222 Merge branch 'integrate-training-from-sbb_pixelwise_segmentation' into training-installation 2025-10-16 16:18:02 +02:00
kba
bd8c8bfeac training: pin numpy to <1.24 as well 2025-10-16 16:15:31 +02:00
Robert Sachunsky
948c8c3441 join_polygons: try to catch rare case of MultiPolygon 2025-10-15 16:58:17 +02:00
kba
f485dd4181 📦 v0.6.0rc2 2025-10-14 16:10:50 +02:00
kba
c1f0158806 📝 changelog 2025-10-14 14:53:15 +02:00
kba
7daa0a1bd5 Merge branch 'fix-196' into prepare-v0.6.0rc2 2025-10-14 14:52:36 +02:00
kba
2febf53479 📝 changelog 2025-10-14 14:52:31 +02:00
Robert Sachunsky
8299e7009a setup_models: avoid unnecessarily loading region_fl 2025-10-14 14:27:32 +02:00
Robert Sachunsky
e8b7212f36 polygon2contour: avoid uint for coords
(introduced in a433c736 to make consistent with
 `filter_contours_area_of_image`, but actually
 np.uint is prone to create overflows downstream)
2025-10-14 14:27:26 +02:00
kba
745cf3be48 XML encoding should be utf-8 not utf8
... and  should use OCR-D's generateDS PAGE API consistently
2025-10-10 16:39:17 +02:00
kba
2056a8bdb9 📦 v0.6.0rc1 2025-10-10 16:32:47 +02:00
Robert Sachunsky
4e9a1618c3 layout: refactor model setup, allow loading custom versions
- simplify definition of (defaults for) model versions
- unify loading of loadable models (depending on mode)
- use `self.models` dict instead of `self.model_*` attributes
- add `model_versions` kwarg / `--model_version` CLI option
2025-10-10 03:18:09 +02:00
Robert Sachunsky
374818de11 📝 update changelog for 5725e4f 2025-10-09 23:11:05 +02:00
Robert Sachunsky
c4cb16c2a8 simplify
(`skip_layout_and_reading_order` is already an attr)
2025-10-09 23:05:50 +02:00
Robert Sachunsky
ecb53056f2 Merge branch 'main' of https://github.com/qurator-spk/eynollah into loky-with-shm-for-175-rebuilt 2025-10-09 22:54:11 +02:00
Robert Sachunsky
d96af425a7
Merge pull request #4 from bertsky/loky-with-shm-for-175-rebuilt-refactored
refactoring for 192: speedup and improvements
2025-10-09 22:18:53 +02:00
Robert Sachunsky
cab392601e 📝 update changelog 2025-10-09 20:14:11 +02:00
Robert Sachunsky
e1b56d97da CI: lint with ruff 2025-10-09 20:14:11 +02:00
Robert Sachunsky
a144026b27 add rough ruff config 2025-10-09 20:14:11 +02:00
Robert Sachunsky
b3d29bef89 return_contours_of_interested_region*: rm unused variants 2025-10-09 20:14:11 +02:00
Robert Sachunsky
8a2d682e12 fix identifier scope in layout OCR options (w/o full_layout) 2025-10-09 20:14:11 +02:00
Robert Sachunsky
096def1e9d mbreorder/enhancment: fix missing imports
(not sure if these models really need that, though)
2025-10-09 20:14:11 +02:00
Robert Sachunsky
027b87d321 fixup c0137c2 (missing arguments for utils_ocr) 2025-10-09 20:14:11 +02:00
Robert Sachunsky
1d4815b48f utils_ocr: forgot to pass coordinate offsets 2025-10-09 20:14:11 +02:00
Robert Sachunsky
839b7c4d84 make models: avoid re-download 2025-10-09 20:14:11 +02:00
Robert Sachunsky
e5b5264568 CI: add diagnostic message for model symlink 2025-10-09 20:14:11 +02:00
Robert Sachunsky
ca72a095ca tests: cover table detection in various modes 2025-10-09 20:14:11 +02:00
Robert Sachunsky
5e11a68a3e writer/run_single: consistent kwarg naming conf_contours_textregion(s) 2025-10-09 20:14:11 +02:00
Robert Sachunsky
75823f9bed run_single: call writer.build_pagexml_no_full_layout w/ kwargs 2025-10-09 20:14:11 +02:00
Robert Sachunsky
cbbb3248c7 writer: simplify
- `build_pagexml_no_full_layout`: delegate to
  `build_pagexml_full_layout` (removing redundant code)
2025-10-09 20:14:11 +02:00
Robert Sachunsky
e32479765c writer: simplify
- simplify serialization of coordinates
- re-use `serialize_lines_in_region` (drop `*_in_dropcapital` and `*_in_marginal`)
- re-use `calculate_polygon_coords`
2025-10-09 20:14:11 +02:00
Robert Sachunsky
d88ca18eec get/do_work_of_slopes etc.: reduce call/return signatures
- `get_textregion_contours_in_org_image_light`: no more need
  to also return unchanged contours here (see 41cc38c5); therefore
- `txt_con_org`: no more need for this
  (now mere alias to `contours_only_text_parent`); also
- `index_by_text_par_con`: no more need for this (see prev. commit),
  so do not pass/return
- `get_slopes_and_deskew_*`: do not pass `contours_only_text`
  (where not used)
- `get_slopes_and_deskew_*`: do not return unchanged contours, boxes
- `do_work_of_slopes_*`: adapt respectively
2025-10-09 20:14:11 +02:00
Robert Sachunsky
02a347a48a no more need to rm from contours_only_text_parent_d_ordered now 2025-10-09 20:14:11 +02:00
Robert Sachunsky
fd43e78442 filter_contours_without_textline_inside: simplify
- np.delete in index array instead of contour lists
- yield actual resulting indices
2025-10-09 20:14:11 +02:00
Robert Sachunsky
0a80cd5dff avoid unnecessary 3-channel conversions: for tables, too 2025-10-09 20:14:11 +02:00
Robert Sachunsky
dfdc705375 do_work_of_slopes: rm unused old variant 2025-10-09 20:14:11 +02:00
Robert Sachunsky
2e907875c1 get_text_region_boxes_by_given_contours: simplify 2025-10-09 20:14:11 +02:00
Robert Sachunsky
d53f829dfd filter_contours_inside_a_bigger_one: fix edge case in 81827c29 2025-10-09 20:14:11 +02:00
Robert Sachunsky
18bbdb7c48 CI: run deps-test with OCR extra so symlink rule fires 2025-10-09 20:14:11 +02:00
Robert Sachunsky
23535998f7 tests: symlink OCR models into layout model directory
(so layout with OCR options works with our split model packages)
2025-10-09 20:14:11 +02:00
Robert Sachunsky
a1904fa660 tests: cover layout with OCR in various modes 2025-10-09 20:14:11 +02:00
Robert Sachunsky
595ed02743 run_single: simplify; allow running TrOCR in non-fl mode, too
- refactor final `self.full_layout` conditional, removing copied code
- allow running `self.ocr` and `self.tr` branch in both cases (non/fl)
- when running TrOCR, use model / processor / device initialised during init
  (instead of ad-hoc loading)
2025-10-09 20:14:11 +02:00