Commit graph

945 commits

Author SHA1 Message Date
Robert Sachunsky
eee9c881ed fixup c0137c2 (missing arguments for utils_ocr) 2025-10-08 15:12:38 +02:00
Robert Sachunsky
9b4e835578 utils_ocr: forgot to pass coordinate offsets 2025-10-08 14:56:14 +02:00
Robert Sachunsky
303a3d484b fixup f700aaf3 2025-10-08 12:58:59 +02:00
Robert Sachunsky
ffe7a2de6b make models: avoid re-download 2025-10-08 12:33:14 +02:00
Robert Sachunsky
ee91caee4a fixup 70344c13 2025-10-08 12:23:10 +02:00
Robert Sachunsky
7ed2da966f CI: add diagnostic message for model symlink 2025-10-08 12:17:53 +02:00
Robert Sachunsky
05fb64676a fixup a388de1 2025-10-08 12:13:12 +02:00
Robert Sachunsky
e4ce4c593b fixup for e451ccd0
(`contours_only_text_parent_d_ordered` is not None
 any more, but always a list)
2025-10-08 02:08:24 +02:00
Robert Sachunsky
26266dd13b writer/run_single: consistent kwarg naming conf_contours_textregion(s) 2025-10-08 01:43:26 +02:00
Robert Sachunsky
7dd51d1b10 run_single: call writer.build_pagexml_no_full_layout w/ kwargs 2025-10-08 01:42:26 +02:00
Robert Sachunsky
cc4f263d88 tests: cover table detection in various modes 2025-10-08 01:42:26 +02:00
Robert Sachunsky
4ec7999803 writer: simplify
- `build_pagexml_no_full_layout`: delegate to
  `build_pagexml_full_layout` (removing redundant code)
2025-10-08 01:42:18 +02:00
Robert Sachunsky
0d3d476f0a writer: simplify
- simplify serialization of coordinates
- re-use `serialize_lines_in_region` (drop `*_in_dropcapital` and `*_in_marginal`)
- re-use `calculate_polygon_coords`
2025-10-07 23:03:27 +02:00
Robert Sachunsky
a388de147c get/do_work_of_slopes etc.: reduce call/return signatures
- `get_textregion_contours_in_org_image_light`: no more need
  to also return unchanged contours here (see 41cc38c5); therefore
- `txt_con_org`: no more need for this
  (now mere alias to `contours_only_text_parent`); also
- `index_by_text_par_con`: no more need for this (see prev. commit),
  so do not pass/return
- `get_slopes_and_deskew_*`: do not pass `contours_only_text`
  (where not used)
- `get_slopes_and_deskew_*`: do not return unchanged contours, boxes
- `do_work_of_slopes_*`: adapt respectively
2025-10-07 22:53:30 +02:00
Robert Sachunsky
e451ccd0a6 no more need to rm from contours_only_text_parent_d_ordered now 2025-10-07 22:47:34 +02:00
Robert Sachunsky
c770108941 filter_contours_without_textline_inside: simplify
- np.delete in index array instead of contour lists
- yield actual resulting indices
2025-10-07 22:42:36 +02:00
Robert Sachunsky
a39a9c5cc4 avoid unnecessary 3-channel conversions: for tables, too 2025-10-07 22:37:05 +02:00
Robert Sachunsky
634d2b059f do_work_of_slopes: rm unused old variant 2025-10-07 22:33:06 +02:00
Robert Sachunsky
3e7628b5cd get_text_region_boxes_b_given_contours: simplify 2025-10-07 22:32:06 +02:00
Robert Sachunsky
316d813db9 filter_contours_inside_a_bigger_one: fix edge case in 81827c29 2025-10-07 22:06:57 +02:00
Robert Sachunsky
f700aaf371 CI: run deps-test with OCR extra so symlink rule fires 2025-10-07 00:54:25 +02:00
Robert Sachunsky
59a19a169d tests: symlink OCR models into layout model directory
(so layout with OCR options works with our split model packages)
2025-10-06 21:27:21 +02:00
Robert Sachunsky
cd8e6b81eb tests: cover layout with OCR in various modes 2025-10-06 17:44:12 +02:00
Robert Sachunsky
4bb93b8f46 run_single: simplify; allow running TrOCR in non-fl mode, too
- refactor final `self.full_layout` conditional, removing copied code
- allow running `self.ocr` and `self.tr` branch in both cases (non/fl)
- when running TrOCR, use model / processor / device initialised during init
  (instead of ad-hoc loading)
2025-10-06 17:24:50 +02:00
Robert Sachunsky
4a18a486a0 textline_contours_postprocessing: do not catch arbitrary exceptions 2025-10-06 16:53:59 +02:00
Robert Sachunsky
70344c137c avoid unnecessary 3-channel conversions: missing cases 2025-10-06 16:53:06 +02:00
Robert Sachunsky
51995c9e46 avoid unnecessary 3-channel conversions 2025-10-06 13:39:54 +02:00
Robert Sachunsky
1fa46303c0 matching deskewed text region contours with predicted: improve
- avoid duplicate and missing mappings by using a different approach:
  instead of just minimising the center distance for the N contours
  that we expect,
  1. get all N:M distances
  2. iterate over them from small to large
  3. continue adding correspondences until both every original contour
     and every deskewed contour have at least one match
  4. where one original matches multiple deskewed contours,
     join the latter polygons to map as single contour
  5. where one deskewed contour matches multiple originals,
     split the former by intersecting with each of the latter
     (after bringing them into the same coordinate space),
     so ultimately only the respective match gets assigned
2025-10-06 13:39:54 +02:00
Robert Sachunsky
2850fc6f8d matching deskewed text region contours with predicted: improve
- apply same min-area filter to deskewed contours as to original ones
2025-10-06 13:39:54 +02:00
Robert Sachunsky
29fcc75c0b matching deskewed text region contours with predicted: improve
- when matching undeskewed and new contours, do not just
  pick the closest centers, respectively, but also of similar
  size (by making the contour area the 3rd dimension of the
  vector norm in the distance calculation)
2025-10-06 13:39:54 +02:00
Robert Sachunsky
56f2d4131e matching deskewed text region contours with predicted: simplify
- (no need for argmax if already sorted)
2025-10-06 13:39:54 +02:00
Robert Sachunsky
04766df3d3 matching deskewed text region contours with predicted: simplify
- avoid loops in favour of array processing
- improve readability and identifiers
2025-10-06 13:39:54 +02:00
Robert Sachunsky
fa58653ec2 do_order_of_regions: simplify
- remove duplicate code via inline def for the try-catch
2025-10-06 13:39:54 +02:00
Robert Sachunsky
0b1ecc02c8 do_order_of_regions: drop redundant no/full_layout
(`_no_full_layout` is the same copied code as `_full_layout`;
 the latter runs just the same if passed an empty list for headings)
2025-10-06 13:39:54 +02:00
Robert Sachunsky
b52ce118b8 do_order_of_regions: simplify
- avoid loops in favour of array processing
2025-10-06 13:39:52 +02:00
Robert Sachunsky
f5e15ed6f9 do_order_of_regions: improve box matching, simplify
- when searching for boxes matching contour, be more precise:
  - avoid heuristic rules ("xmin + 80 within xrange") in favour
    of exact criteria (contour properly contained in box)
  - for fallback criterion (nearest centers), also require
    proper containment of center in box
- `order_of_regions`: remove (now) unnecessary (and insufficient)
  workaround for missing indexes (if boxes are not covering contours
  exactly)
2025-10-06 13:38:11 +02:00
Robert Sachunsky
94599b9b12 order_of_regions: simplify
- use new `find_center_of_contours`
- avoid unused calculations
- avoid loops in favour of array processing
2025-10-06 13:32:35 +02:00
Robert Sachunsky
8897dbe8dd do_order_of_regions / order_of_regions: simplify
- array-convert only once (before returning from `order_of_regions`)
- avoid passing `matrix_of_orders` unnecessarily between
  `order_of_regions` and `order_and_id_of_texts`
2025-10-06 13:32:35 +02:00
Robert Sachunsky
9a7bfd6409 eynollah, drop_capitals: simplify
- use new `find_center_of_contours`
2025-10-06 13:32:35 +02:00
Robert Sachunsky
a06b7da306 do_order_of_regions: simplify
- avoid loops in favour of array processing
2025-10-06 13:32:35 +02:00
Robert Sachunsky
9bcad6f4c4 separate_marginals_to_left_and_right_and_order_from_top_to_down: simplify
- use new `find_center_of_contours`
- avoid loops in favour of array processing
- avoid repeated sorting
2025-10-06 13:32:34 +02:00
Robert Sachunsky
81827c2942 filter_contours_inside_a_bigger_one: simplify
- use new `find_center_of_contours`
- avoid loops in favour of array processing
- use sets instead of `np.unique` and `np.delete` instead of list.pop
2025-10-06 13:32:34 +02:00
Robert Sachunsky
0b9d4901a6 contour features: avoid unused calculations, simplify, add shortcuts
- new function: `find_center_of_contours`
- simplified: `find_(new_)features_of_contours`
2025-10-02 20:51:03 +02:00
Robert Sachunsky
3aa7ad04fa 📝 update changelog 2025-09-30 23:14:52 +02:00
Robert Sachunsky
f0de1adabf rm loky dependency 2025-09-30 23:12:18 +02:00
Robert Sachunsky
7daec392b9 Dockerfile: fix up CUDA installation for mixed TF/Torch 2025-09-30 22:10:45 +02:00
Robert Sachunsky
ad129ed46c CI: remove OS from model cache keys 2025-09-30 22:05:53 +02:00
Robert Sachunsky
c86e59f481 CI: update model key, split up cache restore/save 2025-09-30 22:03:46 +02:00
Robert Sachunsky
a3d8197930 makefile: update model URL 2025-09-30 21:50:21 +02:00
Robert Sachunsky
61b20cc83d tests: switch from subtests to parametrize, use --isolate everywhere to free CUDA memory in between 2025-09-30 19:20:35 +02:00