Robert Sachunsky
8299e7009a
setup_models
: avoid unnecessarily loading region_fl
2025-10-14 14:27:32 +02:00
Robert Sachunsky
e8b7212f36
polygon2contour
: avoid uint for coords
...
(introduced in a433c736
to make consistent with
`filter_contours_area_of_image`, but actually
np.uint is prone to create overflows downstream)
2025-10-14 14:27:26 +02:00
kba
2056a8bdb9
📦 v0.6.0rc1
2025-10-10 16:32:47 +02:00
Robert Sachunsky
4e9a1618c3
layout: refactor model setup, allow loading custom versions
...
- simplify definition of (defaults for) model versions
- unify loading of loadable models (depending on mode)
- use `self.models` dict instead of `self.model_*` attributes
- add `model_versions` kwarg / `--model_version` CLI option
2025-10-10 03:18:09 +02:00
Robert Sachunsky
374818de11
📝 update changelog for 5725e4f
2025-10-09 23:11:05 +02:00
Robert Sachunsky
c4cb16c2a8
simplify
...
(`skip_layout_and_reading_order` is already an attr)
2025-10-09 23:05:50 +02:00
Robert Sachunsky
ecb53056f2
Merge branch 'main' of https://github.com/qurator-spk/eynollah into loky-with-shm-for-175-rebuilt
2025-10-09 22:54:11 +02:00
Robert Sachunsky
d96af425a7
Merge pull request #4 from bertsky/loky-with-shm-for-175-rebuilt-refactored
...
refactoring for 192: speedup and improvements
2025-10-09 22:18:53 +02:00
Robert Sachunsky
cab392601e
📝 update changelog
2025-10-09 20:14:11 +02:00
Robert Sachunsky
e1b56d97da
CI: lint with ruff
2025-10-09 20:14:11 +02:00
Robert Sachunsky
a144026b27
add rough ruff config
2025-10-09 20:14:11 +02:00
Robert Sachunsky
b3d29bef89
return_contours_of_interested_region*: rm unused variants
2025-10-09 20:14:11 +02:00
Robert Sachunsky
8a2d682e12
fix identifier scope in layout OCR options (w/o full_layout)
2025-10-09 20:14:11 +02:00
Robert Sachunsky
096def1e9d
mbreorder/enhancment: fix missing imports
...
(not sure if these models really need that, though)
2025-10-09 20:14:11 +02:00
Robert Sachunsky
027b87d321
fixup c0137c2
(missing arguments for utils_ocr)
2025-10-09 20:14:11 +02:00
Robert Sachunsky
1d4815b48f
utils_ocr: forgot to pass coordinate offsets
2025-10-09 20:14:11 +02:00
Robert Sachunsky
839b7c4d84
make models: avoid re-download
2025-10-09 20:14:11 +02:00
Robert Sachunsky
e5b5264568
CI: add diagnostic message for model symlink
2025-10-09 20:14:11 +02:00
Robert Sachunsky
ca72a095ca
tests: cover table detection in various modes
2025-10-09 20:14:11 +02:00
Robert Sachunsky
5e11a68a3e
writer/run_single: consistent kwarg naming conf_contours_textregion(s)
2025-10-09 20:14:11 +02:00
Robert Sachunsky
75823f9bed
run_single: call writer.build_pagexml_no_full_layout
w/ kwargs
2025-10-09 20:14:11 +02:00
Robert Sachunsky
cbbb3248c7
writer: simplify
...
- `build_pagexml_no_full_layout`: delegate to
`build_pagexml_full_layout` (removing redundant code)
2025-10-09 20:14:11 +02:00
Robert Sachunsky
e32479765c
writer: simplify
...
- simplify serialization of coordinates
- re-use `serialize_lines_in_region` (drop `*_in_dropcapital` and `*_in_marginal`)
- re-use `calculate_polygon_coords`
2025-10-09 20:14:11 +02:00
Robert Sachunsky
d88ca18eec
get/do_work_of_slopes etc.: reduce call/return signatures
...
- `get_textregion_contours_in_org_image_light`: no more need
to also return unchanged contours here (see 41cc38c5
); therefore
- `txt_con_org`: no more need for this
(now mere alias to `contours_only_text_parent`); also
- `index_by_text_par_con`: no more need for this (see prev. commit),
so do not pass/return
- `get_slopes_and_deskew_*`: do not pass `contours_only_text`
(where not used)
- `get_slopes_and_deskew_*`: do not return unchanged contours, boxes
- `do_work_of_slopes_*`: adapt respectively
2025-10-09 20:14:11 +02:00
Robert Sachunsky
02a347a48a
no more need to rm from contours_only_text_parent_d_ordered
now
2025-10-09 20:14:11 +02:00
Robert Sachunsky
fd43e78442
filter_contours_without_textline_inside: simplify
...
- np.delete in index array instead of contour lists
- yield actual resulting indices
2025-10-09 20:14:11 +02:00
Robert Sachunsky
0a80cd5dff
avoid unnecessary 3-channel conversions: for tables, too
2025-10-09 20:14:11 +02:00
Robert Sachunsky
dfdc705375
do_work_of_slopes: rm unused old variant
2025-10-09 20:14:11 +02:00
Robert Sachunsky
2e907875c1
get_text_region_boxes_by_given_contours: simplify
2025-10-09 20:14:11 +02:00
Robert Sachunsky
d53f829dfd
filter_contours_inside_a_bigger_one: fix edge case in 81827c29
2025-10-09 20:14:11 +02:00
Robert Sachunsky
18bbdb7c48
CI: run deps-test with OCR extra so symlink rule fires
2025-10-09 20:14:11 +02:00
Robert Sachunsky
23535998f7
tests: symlink OCR models into layout model directory
...
(so layout with OCR options works with our split model packages)
2025-10-09 20:14:11 +02:00
Robert Sachunsky
a1904fa660
tests: cover layout with OCR in various modes
2025-10-09 20:14:11 +02:00
Robert Sachunsky
595ed02743
run_single: simplify; allow running TrOCR in non-fl mode, too
...
- refactor final `self.full_layout` conditional, removing copied code
- allow running `self.ocr` and `self.tr` branch in both cases (non/fl)
- when running TrOCR, use model / processor / device initialised during init
(instead of ad-hoc loading)
2025-10-09 20:14:11 +02:00
Robert Sachunsky
6e57ab3741
textline_contours_postprocessing: do not catch arbitrary exceptions
2025-10-09 20:14:11 +02:00
Robert Sachunsky
fe603188f4
avoid unnecessary 3-channel conversions
2025-10-09 20:14:11 +02:00
Robert Sachunsky
155b8f68b8
matching deskewed text region contours with predicted: improve
...
- avoid duplicate and missing mappings by using a different approach:
instead of just minimising the center distance for the N contours
that we expect,
1. get all N:M distances
2. iterate over them from small to large
3. continue adding correspondences until both every original contour
and every deskewed contour have at least one match
4. where one original matches multiple deskewed contours,
join the latter polygons to map as single contour
5. where one deskewed contour matches multiple originals,
split the former by intersecting with each of the latter
(after bringing them into the same coordinate space),
so ultimately only the respective match gets assigned
2025-10-09 20:14:11 +02:00
Robert Sachunsky
0e00d7868b
matching deskewed text region contours with predicted: improve
...
- apply same min-area filter to deskewed contours as to original ones
2025-10-09 20:14:11 +02:00
Robert Sachunsky
0f33c21eb3
matching deskewed text region contours with predicted: improve
...
- when matching undeskewed and new contours, do not just
pick the closest centers, respectively, but also of similar
size (by making the contour area the 3rd dimension of the
vector norm in the distance calculation)
2025-10-09 20:14:11 +02:00
Robert Sachunsky
73e5a1def8
matching deskewed text region contours with predicted: simplify
...
- (no need for argmax if already sorted)
2025-10-09 20:14:11 +02:00
Robert Sachunsky
d774a23daa
matching deskewed text region contours with predicted: simplify
...
- avoid loops in favour of array processing
- improve readability and identifiers
2025-10-09 20:14:11 +02:00
Robert Sachunsky
29b4527bde
do_order_of_regions: simplify
...
- remove duplicate code via inline def for the try-catch
2025-10-09 20:14:11 +02:00
Robert Sachunsky
e674ea08f3
do_order_of_regions: drop redundant no/full_layout
...
(`_no_full_layout` is the same copied code as `_full_layout`;
the latter runs just the same if passed an empty list for headings)
2025-10-09 20:14:11 +02:00
Robert Sachunsky
e9bb62bd86
do_order_of_regions: simplify
...
- avoid loops in favour of array processing
2025-10-09 20:14:11 +02:00
Robert Sachunsky
7387f5a929
do_order_of_regions: improve box matching, simplify
...
- when searching for boxes matching contour, be more precise:
- avoid heuristic rules ("xmin + 80 within xrange") in favour
of exact criteria (contour properly contained in box)
- for fallback criterion (nearest centers), also require
proper containment of center in box
- `order_of_regions`: remove (now) unnecessary (and insufficient)
workaround for missing indexes (if boxes are not covering contours
exactly)
2025-10-09 20:14:11 +02:00
Robert Sachunsky
4950e6bd78
order_of_regions: simplify
...
- use new `find_center_of_contours`
- avoid unused calculations
- avoid loops in favour of array processing
2025-10-09 20:14:10 +02:00
Robert Sachunsky
a1c8fd4467
do_order_of_regions / order_of_regions: simplify
...
- array-convert only once (before returning from `order_of_regions`)
- avoid passing `matrix_of_orders` unnecessarily between
`order_of_regions` and `order_and_id_of_texts`
2025-10-09 20:14:10 +02:00
Robert Sachunsky
415b2cbad8
eynollah, drop_capitals: simplify
...
- use new `find_center_of_contours`
2025-10-09 20:14:10 +02:00
Robert Sachunsky
3f3353ec3a
do_order_of_regions: simplify
...
- avoid loops in favour of array processing
2025-10-09 20:14:10 +02:00
Robert Sachunsky
8c3d5eb0eb
separate_marginals_to_left_and_right_and_order_from_top_to_down: simplify
...
- use new `find_center_of_contours`
- avoid loops in favour of array processing
- avoid repeated sorting
2025-10-09 20:14:10 +02:00