Commit graph

1136 commits

Author SHA1 Message Date
Robert Sachunsky
6e57ab3741 textline_contours_postprocessing: do not catch arbitrary exceptions 2025-10-09 20:14:11 +02:00
Robert Sachunsky
fe603188f4 avoid unnecessary 3-channel conversions 2025-10-09 20:14:11 +02:00
Robert Sachunsky
155b8f68b8 matching deskewed text region contours with predicted: improve
- avoid duplicate and missing mappings by using a different approach:
  instead of just minimising the center distance for the N contours
  that we expect,
  1. get all N:M distances
  2. iterate over them from small to large
  3. continue adding correspondences until both every original contour
     and every deskewed contour have at least one match
  4. where one original matches multiple deskewed contours,
     join the latter polygons to map as single contour
  5. where one deskewed contour matches multiple originals,
     split the former by intersecting with each of the latter
     (after bringing them into the same coordinate space),
     so ultimately only the respective match gets assigned
2025-10-09 20:14:11 +02:00
Robert Sachunsky
0e00d7868b matching deskewed text region contours with predicted: improve
- apply same min-area filter to deskewed contours as to original ones
2025-10-09 20:14:11 +02:00
Robert Sachunsky
0f33c21eb3 matching deskewed text region contours with predicted: improve
- when matching undeskewed and new contours, do not just
  pick the closest centers, respectively, but also of similar
  size (by making the contour area the 3rd dimension of the
  vector norm in the distance calculation)
2025-10-09 20:14:11 +02:00
Robert Sachunsky
73e5a1def8 matching deskewed text region contours with predicted: simplify
- (no need for argmax if already sorted)
2025-10-09 20:14:11 +02:00
Robert Sachunsky
d774a23daa matching deskewed text region contours with predicted: simplify
- avoid loops in favour of array processing
- improve readability and identifiers
2025-10-09 20:14:11 +02:00
Robert Sachunsky
29b4527bde do_order_of_regions: simplify
- remove duplicate code via inline def for the try-catch
2025-10-09 20:14:11 +02:00
Robert Sachunsky
e674ea08f3 do_order_of_regions: drop redundant no/full_layout
(`_no_full_layout` is the same copied code as `_full_layout`;
 the latter runs just the same if passed an empty list for headings)
2025-10-09 20:14:11 +02:00
Robert Sachunsky
e9bb62bd86 do_order_of_regions: simplify
- avoid loops in favour of array processing
2025-10-09 20:14:11 +02:00
Robert Sachunsky
7387f5a929 do_order_of_regions: improve box matching, simplify
- when searching for boxes matching contour, be more precise:
  - avoid heuristic rules ("xmin + 80 within xrange") in favour
    of exact criteria (contour properly contained in box)
  - for fallback criterion (nearest centers), also require
    proper containment of center in box
- `order_of_regions`: remove (now) unnecessary (and insufficient)
  workaround for missing indexes (if boxes are not covering contours
  exactly)
2025-10-09 20:14:11 +02:00
Robert Sachunsky
4950e6bd78 order_of_regions: simplify
- use new `find_center_of_contours`
- avoid unused calculations
- avoid loops in favour of array processing
2025-10-09 20:14:10 +02:00
Robert Sachunsky
a1c8fd4467 do_order_of_regions / order_of_regions: simplify
- array-convert only once (before returning from `order_of_regions`)
- avoid passing `matrix_of_orders` unnecessarily between
  `order_of_regions` and `order_and_id_of_texts`
2025-10-09 20:14:10 +02:00
Robert Sachunsky
415b2cbad8 eynollah, drop_capitals: simplify
- use new `find_center_of_contours`
2025-10-09 20:14:10 +02:00
Robert Sachunsky
3f3353ec3a do_order_of_regions: simplify
- avoid loops in favour of array processing
2025-10-09 20:14:10 +02:00
Robert Sachunsky
8c3d5eb0eb separate_marginals_to_left_and_right_and_order_from_top_to_down: simplify
- use new `find_center_of_contours`
- avoid loops in favour of array processing
- avoid repeated sorting
2025-10-09 20:14:10 +02:00
kba
8215814a3f Merge branch 'changelog-v0.5.0' 2025-10-09 14:03:45 +02:00
kba
4ffe6190d2 📝 changelog 2025-10-09 14:03:26 +02:00
vahidrezanezhad
8869c20c33 updating CHANGELOG for v0.5.0 2025-10-09 13:54:29 +02:00
Robert Sachunsky
81827c2942 filter_contours_inside_a_bigger_one: simplify
- use new `find_center_of_contours`
- avoid loops in favour of array processing
- use sets instead of `np.unique` and `np.delete` instead of list.pop
2025-10-06 13:32:34 +02:00
Robert Sachunsky
0b9d4901a6 contour features: avoid unused calculations, simplify, add shortcuts
- new function: `find_center_of_contours`
- simplified: `find_(new_)features_of_contours`
2025-10-02 20:51:03 +02:00
kba
8a9b4f8f55 remove commented-out requirement for tf == 2.12.1, rely on same version as in eynollah proper 2025-10-02 12:16:26 +02:00
kba
f60e0543ab training: update docs 2025-10-01 19:16:58 +02:00
kba
1c043c586a eynollah-training: all training CLI into single click group 2025-10-01 19:16:45 +02:00
kba
690d47444c make relative wildcard imports explicit 2025-10-01 18:43:20 +02:00
kba
2baf42e878 organize imports, use relative imports 2025-10-01 18:15:54 +02:00
kba
4f5cdf3140 move training scripts to src/eynollah/training 2025-10-01 18:12:45 +02:00
kba
f0ef2b5db2 remove unused imports 2025-10-01 18:10:13 +02:00
kba
95bb5908bb Merge branch 'integrate-training-from-sbb_pixelwise_segmentation' of https://github.com/qurator-spk/eynollah into integrate-training-from-sbb_pixelwise_segmentation 2025-10-01 18:02:09 +02:00
kba
48266b1ee0 make training dependencies optional-dependencies of eynollah
i.e. `pip install "eynollah[training]"` will install the requirements for training
2025-10-01 18:01:25 +02:00
kba
733af1e9a7 📝 update train/README.md, align with docs/train.md 2025-10-01 17:43:32 +02:00
vahidrezanezhad
5725e4fd1f -Continue processing when num_col is None but textregions exist. -Convert marginal-only to main body if no main body is present. -Reset deskew angle to 0 when text region density (textregion area to page area) < 0.3 and angle > 45°. 2025-10-01 15:58:03 +02:00
cneud
4514d417a7 force GH markdown code block in list 2025-10-01 01:16:25 +02:00
cneud
e027bc038e Update README.md 2025-10-01 01:05:15 +02:00
cneud
91d2a74ac9 remove redundant parentheses 2025-10-01 00:38:01 +02:00
cneud
f2f93e0251 list literal is faster than using list constructor to create a new list 2025-10-01 00:26:27 +02:00
cneud
70af00182b mutable defaults are the source of all evil 2025-10-01 00:20:18 +02:00
cneud
1d0616eb69 comparisons to None should not use the equality operators 2025-10-01 00:15:11 +02:00
cneud
9ce127eb51 remove unnecessary backslash 2025-10-01 00:04:53 +02:00
cneud
558867eb24 fix typo 2025-10-01 00:04:07 +02:00
Robert Sachunsky
3aa7ad04fa 📝 update changelog 2025-09-30 23:14:52 +02:00
Robert Sachunsky
f0de1adabf rm loky dependency 2025-09-30 23:12:18 +02:00
Robert Sachunsky
7daec392b9 Dockerfile: fix up CUDA installation for mixed TF/Torch 2025-09-30 22:10:45 +02:00
Robert Sachunsky
ad129ed46c CI: remove OS from model cache keys 2025-09-30 22:05:53 +02:00
Robert Sachunsky
c86e59f481 CI: update model key, split up cache restore/save 2025-09-30 22:03:46 +02:00
Robert Sachunsky
a3d8197930 makefile: update model URL 2025-09-30 21:50:21 +02:00
Robert Sachunsky
61b20cc83d tests: switch from subtests to parametrize, use --isolate everywhere to free CUDA memory in between 2025-09-30 19:20:35 +02:00
Robert Sachunsky
375e0263d4 CNN-RNN OCR model: switch to 20250930 version (compatible with TF 2.12 on CPU as well) 2025-09-30 19:16:50 +02:00
Robert Sachunsky
b21051db21 ProcessPoolExecutor: shutdown during del() instead of atexit() 2025-09-30 19:16:00 +02:00
Robert Sachunsky
08c8c26028 indent extremely long lines 2025-09-30 03:52:19 +02:00