Robert Sachunsky
ffe7a2de6b
make models: avoid re-download
2025-10-08 12:33:14 +02:00
Robert Sachunsky
ee91caee4a
fixup 70344c13
2025-10-08 12:23:10 +02:00
Robert Sachunsky
7ed2da966f
CI: add diagnostic message for model symlink
2025-10-08 12:17:53 +02:00
Robert Sachunsky
05fb64676a
fixup a388de1
2025-10-08 12:13:12 +02:00
Robert Sachunsky
e4ce4c593b
fixup for e451ccd0
...
(`contours_only_text_parent_d_ordered` is not None
any more, but always a list)
2025-10-08 02:08:24 +02:00
Robert Sachunsky
26266dd13b
writer/run_single: consistent kwarg naming conf_contours_textregion(s)
2025-10-08 01:43:26 +02:00
Robert Sachunsky
7dd51d1b10
run_single: call writer.build_pagexml_no_full_layout w/ kwargs
2025-10-08 01:42:26 +02:00
Robert Sachunsky
cc4f263d88
tests: cover table detection in various modes
2025-10-08 01:42:26 +02:00
Robert Sachunsky
4ec7999803
writer: simplify
...
- `build_pagexml_no_full_layout`: delegate to
`build_pagexml_full_layout` (removing redundant code)
2025-10-08 01:42:18 +02:00
Robert Sachunsky
0d3d476f0a
writer: simplify
...
- simplify serialization of coordinates
- re-use `serialize_lines_in_region` (drop `*_in_dropcapital` and `*_in_marginal`)
- re-use `calculate_polygon_coords`
2025-10-07 23:03:27 +02:00
Robert Sachunsky
a388de147c
get/do_work_of_slopes etc.: reduce call/return signatures
...
- `get_textregion_contours_in_org_image_light`: no more need
to also return unchanged contours here (see 41cc38c5 ); therefore
- `txt_con_org`: no more need for this
(now mere alias to `contours_only_text_parent`); also
- `index_by_text_par_con`: no more need for this (see prev. commit),
so do not pass/return
- `get_slopes_and_deskew_*`: do not pass `contours_only_text`
(where not used)
- `get_slopes_and_deskew_*`: do not return unchanged contours, boxes
- `do_work_of_slopes_*`: adapt respectively
2025-10-07 22:53:30 +02:00
Robert Sachunsky
e451ccd0a6
no more need to rm from contours_only_text_parent_d_ordered now
2025-10-07 22:47:34 +02:00
Robert Sachunsky
c770108941
filter_contours_without_textline_inside: simplify
...
- np.delete in index array instead of contour lists
- yield actual resulting indices
2025-10-07 22:42:36 +02:00
Robert Sachunsky
a39a9c5cc4
avoid unnecessary 3-channel conversions: for tables, too
2025-10-07 22:37:05 +02:00
Robert Sachunsky
634d2b059f
do_work_of_slopes: rm unused old variant
2025-10-07 22:33:06 +02:00
Robert Sachunsky
3e7628b5cd
get_text_region_boxes_b_given_contours: simplify
2025-10-07 22:32:06 +02:00
Robert Sachunsky
316d813db9
filter_contours_inside_a_bigger_one: fix edge case in 81827c29
2025-10-07 22:06:57 +02:00
Robert Sachunsky
f700aaf371
CI: run deps-test with OCR extra so symlink rule fires
2025-10-07 00:54:25 +02:00
Robert Sachunsky
59a19a169d
tests: symlink OCR models into layout model directory
...
(so layout with OCR options works with our split model packages)
2025-10-06 21:27:21 +02:00
Robert Sachunsky
cd8e6b81eb
tests: cover layout with OCR in various modes
2025-10-06 17:44:12 +02:00
Robert Sachunsky
4bb93b8f46
run_single: simplify; allow running TrOCR in non-fl mode, too
...
- refactor final `self.full_layout` conditional, removing copied code
- allow running `self.ocr` and `self.tr` branch in both cases (non/fl)
- when running TrOCR, use model / processor / device initialised during init
(instead of ad-hoc loading)
2025-10-06 17:24:50 +02:00
Robert Sachunsky
4a18a486a0
textline_contours_postprocessing: do not catch arbitrary exceptions
2025-10-06 16:53:59 +02:00
Robert Sachunsky
70344c137c
avoid unnecessary 3-channel conversions: missing cases
2025-10-06 16:53:06 +02:00
Robert Sachunsky
51995c9e46
avoid unnecessary 3-channel conversions
2025-10-06 13:39:54 +02:00
Robert Sachunsky
1fa46303c0
matching deskewed text region contours with predicted: improve
...
- avoid duplicate and missing mappings by using a different approach:
instead of just minimising the center distance for the N contours
that we expect,
1. get all N:M distances
2. iterate over them from small to large
3. continue adding correspondences until both every original contour
and every deskewed contour have at least one match
4. where one original matches multiple deskewed contours,
join the latter polygons to map as single contour
5. where one deskewed contour matches multiple originals,
split the former by intersecting with each of the latter
(after bringing them into the same coordinate space),
so ultimately only the respective match gets assigned
2025-10-06 13:39:54 +02:00
Robert Sachunsky
2850fc6f8d
matching deskewed text region contours with predicted: improve
...
- apply same min-area filter to deskewed contours as to original ones
2025-10-06 13:39:54 +02:00
Robert Sachunsky
29fcc75c0b
matching deskewed text region contours with predicted: improve
...
- when matching undeskewed and new contours, do not just
pick the closest centers, respectively, but also of similar
size (by making the contour area the 3rd dimension of the
vector norm in the distance calculation)
2025-10-06 13:39:54 +02:00
Robert Sachunsky
56f2d4131e
matching deskewed text region contours with predicted: simplify
...
- (no need for argmax if already sorted)
2025-10-06 13:39:54 +02:00
Robert Sachunsky
04766df3d3
matching deskewed text region contours with predicted: simplify
...
- avoid loops in favour of array processing
- improve readability and identifiers
2025-10-06 13:39:54 +02:00
Robert Sachunsky
fa58653ec2
do_order_of_regions: simplify
...
- remove duplicate code via inline def for the try-catch
2025-10-06 13:39:54 +02:00
Robert Sachunsky
0b1ecc02c8
do_order_of_regions: drop redundant no/full_layout
...
(`_no_full_layout` is the same copied code as `_full_layout`;
the latter runs just the same if passed an empty list for headings)
2025-10-06 13:39:54 +02:00
Robert Sachunsky
b52ce118b8
do_order_of_regions: simplify
...
- avoid loops in favour of array processing
2025-10-06 13:39:52 +02:00
Robert Sachunsky
f5e15ed6f9
do_order_of_regions: improve box matching, simplify
...
- when searching for boxes matching contour, be more precise:
- avoid heuristic rules ("xmin + 80 within xrange") in favour
of exact criteria (contour properly contained in box)
- for fallback criterion (nearest centers), also require
proper containment of center in box
- `order_of_regions`: remove (now) unnecessary (and insufficient)
workaround for missing indexes (if boxes are not covering contours
exactly)
2025-10-06 13:38:11 +02:00
Robert Sachunsky
94599b9b12
order_of_regions: simplify
...
- use new `find_center_of_contours`
- avoid unused calculations
- avoid loops in favour of array processing
2025-10-06 13:32:35 +02:00
Robert Sachunsky
8897dbe8dd
do_order_of_regions / order_of_regions: simplify
...
- array-convert only once (before returning from `order_of_regions`)
- avoid passing `matrix_of_orders` unnecessarily between
`order_of_regions` and `order_and_id_of_texts`
2025-10-06 13:32:35 +02:00
Robert Sachunsky
9a7bfd6409
eynollah, drop_capitals: simplify
...
- use new `find_center_of_contours`
2025-10-06 13:32:35 +02:00
Robert Sachunsky
a06b7da306
do_order_of_regions: simplify
...
- avoid loops in favour of array processing
2025-10-06 13:32:35 +02:00
Robert Sachunsky
9bcad6f4c4
separate_marginals_to_left_and_right_and_order_from_top_to_down: simplify
...
- use new `find_center_of_contours`
- avoid loops in favour of array processing
- avoid repeated sorting
2025-10-06 13:32:34 +02:00
Robert Sachunsky
81827c2942
filter_contours_inside_a_bigger_one: simplify
...
- use new `find_center_of_contours`
- avoid loops in favour of array processing
- use sets instead of `np.unique` and `np.delete` instead of list.pop
2025-10-06 13:32:34 +02:00
Robert Sachunsky
0b9d4901a6
contour features: avoid unused calculations, simplify, add shortcuts
...
- new function: `find_center_of_contours`
- simplified: `find_(new_)features_of_contours`
2025-10-02 20:51:03 +02:00
Robert Sachunsky
3aa7ad04fa
📝 update changelog
2025-09-30 23:14:52 +02:00
Robert Sachunsky
f0de1adabf
rm loky dependency
2025-09-30 23:12:18 +02:00
Robert Sachunsky
7daec392b9
Dockerfile: fix up CUDA installation for mixed TF/Torch
2025-09-30 22:10:45 +02:00
Robert Sachunsky
ad129ed46c
CI: remove OS from model cache keys
2025-09-30 22:05:53 +02:00
Robert Sachunsky
c86e59f481
CI: update model key, split up cache restore/save
2025-09-30 22:03:46 +02:00
Robert Sachunsky
a3d8197930
makefile: update model URL
2025-09-30 21:50:21 +02:00
Robert Sachunsky
61b20cc83d
tests: switch from subtests to parametrize, use --isolate everywhere to free CUDA memory in between
2025-09-30 19:20:35 +02:00
Robert Sachunsky
375e0263d4
CNN-RNN OCR model: switch to 20250930 version (compatible with TF 2.12 on CPU as well)
2025-09-30 19:16:50 +02:00
Robert Sachunsky
b21051db21
ProcessPoolExecutor: shutdown during del() instead of atexit()
2025-09-30 19:16:00 +02:00
Robert Sachunsky
08c8c26028
indent extremely long lines
2025-09-30 03:52:19 +02:00