Robert Sachunsky
6e57ab3741
textline_contours_postprocessing: do not catch arbitrary exceptions
2025-10-09 20:14:11 +02:00
Robert Sachunsky
fe603188f4
avoid unnecessary 3-channel conversions
2025-10-09 20:14:11 +02:00
Robert Sachunsky
155b8f68b8
matching deskewed text region contours with predicted: improve
...
- avoid duplicate and missing mappings by using a different approach:
instead of just minimising the center distance for the N contours
that we expect,
1. get all N:M distances
2. iterate over them from small to large
3. continue adding correspondences until both every original contour
and every deskewed contour have at least one match
4. where one original matches multiple deskewed contours,
join the latter polygons to map as single contour
5. where one deskewed contour matches multiple originals,
split the former by intersecting with each of the latter
(after bringing them into the same coordinate space),
so ultimately only the respective match gets assigned
2025-10-09 20:14:11 +02:00
Robert Sachunsky
0e00d7868b
matching deskewed text region contours with predicted: improve
...
- apply same min-area filter to deskewed contours as to original ones
2025-10-09 20:14:11 +02:00
Robert Sachunsky
0f33c21eb3
matching deskewed text region contours with predicted: improve
...
- when matching undeskewed and new contours, do not just
pick the closest centers, respectively, but also of similar
size (by making the contour area the 3rd dimension of the
vector norm in the distance calculation)
2025-10-09 20:14:11 +02:00
Robert Sachunsky
73e5a1def8
matching deskewed text region contours with predicted: simplify
...
- (no need for argmax if already sorted)
2025-10-09 20:14:11 +02:00
Robert Sachunsky
d774a23daa
matching deskewed text region contours with predicted: simplify
...
- avoid loops in favour of array processing
- improve readability and identifiers
2025-10-09 20:14:11 +02:00
Robert Sachunsky
29b4527bde
do_order_of_regions: simplify
...
- remove duplicate code via inline def for the try-catch
2025-10-09 20:14:11 +02:00
Robert Sachunsky
e674ea08f3
do_order_of_regions: drop redundant no/full_layout
...
(`_no_full_layout` is the same copied code as `_full_layout`;
the latter runs just the same if passed an empty list for headings)
2025-10-09 20:14:11 +02:00
Robert Sachunsky
e9bb62bd86
do_order_of_regions: simplify
...
- avoid loops in favour of array processing
2025-10-09 20:14:11 +02:00
Robert Sachunsky
7387f5a929
do_order_of_regions: improve box matching, simplify
...
- when searching for boxes matching contour, be more precise:
- avoid heuristic rules ("xmin + 80 within xrange") in favour
of exact criteria (contour properly contained in box)
- for fallback criterion (nearest centers), also require
proper containment of center in box
- `order_of_regions`: remove (now) unnecessary (and insufficient)
workaround for missing indexes (if boxes are not covering contours
exactly)
2025-10-09 20:14:11 +02:00
Robert Sachunsky
4950e6bd78
order_of_regions: simplify
...
- use new `find_center_of_contours`
- avoid unused calculations
- avoid loops in favour of array processing
2025-10-09 20:14:10 +02:00
Robert Sachunsky
a1c8fd4467
do_order_of_regions / order_of_regions: simplify
...
- array-convert only once (before returning from `order_of_regions`)
- avoid passing `matrix_of_orders` unnecessarily between
`order_of_regions` and `order_and_id_of_texts`
2025-10-09 20:14:10 +02:00
Robert Sachunsky
415b2cbad8
eynollah, drop_capitals: simplify
...
- use new `find_center_of_contours`
2025-10-09 20:14:10 +02:00
Robert Sachunsky
3f3353ec3a
do_order_of_regions: simplify
...
- avoid loops in favour of array processing
2025-10-09 20:14:10 +02:00
Robert Sachunsky
8c3d5eb0eb
separate_marginals_to_left_and_right_and_order_from_top_to_down: simplify
...
- use new `find_center_of_contours`
- avoid loops in favour of array processing
- avoid repeated sorting
2025-10-09 20:14:10 +02:00
Robert Sachunsky
81827c2942
filter_contours_inside_a_bigger_one: simplify
...
- use new `find_center_of_contours`
- avoid loops in favour of array processing
- use sets instead of `np.unique` and `np.delete` instead of list.pop
2025-10-06 13:32:34 +02:00
Robert Sachunsky
0b9d4901a6
contour features: avoid unused calculations, simplify, add shortcuts
...
- new function: `find_center_of_contours`
- simplified: `find_(new_)features_of_contours`
2025-10-02 20:51:03 +02:00
Robert Sachunsky
3aa7ad04fa
📝 update changelog
2025-09-30 23:14:52 +02:00
Robert Sachunsky
f0de1adabf
rm loky dependency
2025-09-30 23:12:18 +02:00
Robert Sachunsky
7daec392b9
Dockerfile: fix up CUDA installation for mixed TF/Torch
2025-09-30 22:10:45 +02:00
Robert Sachunsky
ad129ed46c
CI: remove OS from model cache keys
2025-09-30 22:05:53 +02:00
Robert Sachunsky
c86e59f481
CI: update model key, split up cache restore/save
2025-09-30 22:03:46 +02:00
Robert Sachunsky
a3d8197930
makefile: update model URL
2025-09-30 21:50:21 +02:00
Robert Sachunsky
61b20cc83d
tests: switch from subtests to parametrize, use --isolate everywhere to free CUDA memory in between
2025-09-30 19:20:35 +02:00
Robert Sachunsky
375e0263d4
CNN-RNN OCR model: switch to 20250930 version (compatible with TF 2.12 on CPU as well)
2025-09-30 19:16:50 +02:00
Robert Sachunsky
b21051db21
ProcessPoolExecutor: shutdown during del() instead of atexit()
2025-09-30 19:16:00 +02:00
Robert Sachunsky
08c8c26028
indent extremely long lines
2025-09-30 03:52:19 +02:00
Robert Sachunsky
f857ee7b51
simplify
2025-09-30 02:26:00 +02:00
Robert Sachunsky
c0137c29ad
try to fix the failed outsourcing of utils_ocr
2025-09-30 02:23:43 +02:00
Robert Sachunsky
13f85b0d5c
Merge branch 'main' into loky-with-shm-for-175-rebuilt
2025-09-30 02:07:20 +02:00
Robert Sachunsky
758602403e
replace loky with concurrent.futures.ProcessPoolExecutor (faster)
2025-09-29 17:48:22 +02:00
Robert Sachunsky
0366707136
get_smallest_skew: do not pass logger
2025-09-29 17:48:22 +02:00
Robert Sachunsky
b94c96fcbb
find_num_col: exit early if empty (avoiding exceptions)
2025-09-29 17:48:22 +02:00
Robert Sachunsky
04c3d7dd1b
get_smallest_skew: avoid shm if no ProcessPoolExecutor is passed
2025-09-29 17:48:22 +02:00
Robert Sachunsky
0662ece536
do_work_of_slopes*: use shm also in non-light mode(s)
2025-09-29 17:48:22 +02:00
Robert Sachunsky
31f240c3b8
do_image_rotation, do_work_of_slopes_new_curved: pass arrays via shared memory
2025-09-29 17:48:22 +02:00
Robert Sachunsky
8be2c79771
Revert "deskewing with faster multiprocessing"
...
This reverts commit 5db3e9fa64
.
2025-09-29 17:48:22 +02:00
Robert Sachunsky
abf5c0f845
get_smallest_skew: when shifting search range of rotation angle, compare resulting (maximum) variances instead of blindly assuming the new range is better
2025-09-29 17:48:22 +02:00
Robert Sachunsky
dc0caad512
writer: use @type='heading' instead of 'header'
2025-09-29 17:48:22 +02:00
Robert Sachunsky
f458e3ece0
writer: SeparatorRegion needs SeparatorRegionType (not ImageRegionType)
2025-09-29 17:48:22 +02:00
Robert Sachunsky
4337d62985
contours: rename 'pixel' → 'label' for clarity
2025-09-29 17:48:22 +02:00
Robert Sachunsky
5b16c2fc00
avoid pulling unused 'image_page_rotated' through functions
2025-09-29 17:48:22 +02:00
Robert Sachunsky
5bff2d156a
use box2rect instead of crop_image_inside_box when no image needed
2025-09-29 17:48:22 +02:00
Robert Sachunsky
9b5182c1c0
utils: introduce box2rect and box2slice
2025-09-29 17:48:19 +02:00
Robert Sachunsky
bca2ae3d78
get_marginals: exit early if no peaks found to avoid spurious overlap mask
2025-09-29 17:47:51 +02:00
Robert Sachunsky
235539a350
filter_contours_without_textline_inside: avoid removing from identical lists twice
2025-09-29 17:47:51 +02:00
Robert Sachunsky
11e143afee
polygon2contour: avoid overflow
2025-09-29 17:47:51 +02:00
Robert Sachunsky
7a9e8256ee
increase dilatation: textregions/lines (5→6), seplines (0→1)
2025-09-29 17:47:51 +02:00
Robert Sachunsky
f3faa29528
refactor shapely converisons into contour2polygon / polygon2contour, also handle heterogeneous geometries
2025-09-29 17:47:51 +02:00