Commit graph

1514 commits

Author SHA1 Message Date
Robert Sachunsky
db87aa995d reqs for OCR: relax ad5f2272 (depending on Python version) 2026-05-11 03:15:54 +02:00
Robert Sachunsky
e183937c5d separate_lines_new2: fix coord overflow by clipping, simplify…
- found positive and negative peaks, and even more so their
  relative offsets, may overflow in the cropped image,
  causing fake textlines; avoid that by clipping to the valid
  y coordinates
- calculation for number of tiles: sometimes one less
  tile is needed by making the previous last tile
  half-full on the right side
- add some (commented) plotting
- simplify (a lot, but only partially)
2026-05-11 03:09:02 +02:00
Robert Sachunsky
130f0aee42 do_work_of_slopes_curved: improve on d257869d
- relative images now need larger relative min_area
  (i.e. compensation factors)
- do not attempt (even) single-line skew estimation
  (via linear regression) if there is no (large enough)
  contour at all
- avoid re-computing `mask_parent`
- add some (commented) plotting
2026-05-11 03:03:04 +02:00
Robert Sachunsky
a61fb09ec5 CI: drop py3.8 (u/a for new req transformers >= 5) 2026-05-09 04:14:49 +02:00
Robert Sachunsky
4406a0299e update CLI test for binarization…
- update expected log messages
2026-05-09 04:12:19 +02:00
Robert Sachunsky
4cd398bd0d standalone binarization: update, simplify…
- re-use Eynollah base class, drop copied code
- simplify `run()` and `run_single()`
- delegate to `do_prediction()`
  instead of custom (old) tiling loop
- drop `predict()`
- add `--device` option to CLI as well
2026-05-09 04:12:02 +02:00
Robert Sachunsky
29abae0144 update CLI test for enhancer…
- update expected log messages
- force `-ncu 3`, because otherwise
  the example images would not be deemed
  in need of enhancement
2026-05-09 02:59:52 +02:00
Robert Sachunsky
c1b6a61301 standalone enhancer: make this work (at all)…
- re-use Eynollah base class, drop copied code
- write usable `run()` and `run_single()`
- delegate to `resize_image_with_column_classifier()`
  for column classifier, resizing and enhancement,
  instead of `resize_and_enhance_image_with_column_classifier()`
  (which does _not_ actually enhance)
- drop unused `predict_enhancement()`
- add defaults to `num_col` options (always numeric)
- add `--device` option to CLI as well
2026-05-09 02:55:01 +02:00
Robert Sachunsky
d63ce5538c resize_image_with_column_classifier(): apply num_col bounds here too
use rules from `resize_and_enhance_image_with_column_classifier()`
and apply them to `resize_image_with_column_classifier()` as well

(to be used by enhancer CLI)
2026-05-09 02:53:04 +02:00
Robert Sachunsky
6df2144c0f fix 2 typos in previous commits…
- becf031c65
- cefe596f8b
2026-05-09 02:31:22 +02:00
Robert Sachunsky
daf0c90d6e
Merge pull request #8 from bertsky/ro-fixes-training-reload
training: reload models
2026-05-08 18:46:43 +02:00
Robert Sachunsky
395decd6d6
Merge pull request #7 from qurator-spk/ro-fixes-training-reload-additions
Ro fixes training reload additions
2026-05-08 18:45:28 +02:00
Robert Sachunsky
3a9d72d3fc
Merge pull request #6 from qurator-spk/update-cd
Deploy versioned docker images and update transformers
2026-05-08 18:44:49 +02:00
Robert Sachunsky
ea8f985ff1 apply cropping only after textline and early layout…
(because old models seem to fare better that way,
 despite training documentation)
2026-05-08 18:41:47 +02:00
Robert Sachunsky
58afdf5e87 do_prediction*(): ensure always returns dtype=uint8 2026-05-08 17:36:31 +02:00
Robert Sachunsky
68a26a5c3f do_prediction*(): smooth window transitions with sigmoid…
instead of hard cut-offs between overlapping window tiles,
apply sigmoid attenuation to slide from one to the next

(apply all postprocessing in the end)
2026-05-08 05:18:00 +02:00
Robert Sachunsky
cefe596f8b do_prediction*(): avoid unnecessary tiles, simplify…
- calculation for number of tiles: sometimes one less
  tile is needed by making the previous last tile
  half-full on the right side
- calculation of window margins: fix case if dimension
  extends to full image shape
- simplify (identifiers, slicing etc)
2026-05-08 00:55:18 +02:00
kba
a0bf1b51f4 makefile to reload models 2026-05-07 19:30:29 +02:00
kba
34a9d458ce training deps: use sacred fork w/o pkg_resources, pin tf/tf_keras, protobuf packages to work with tensorflow_addons 2026-05-07 18:09:27 +02:00
kba
2747385f89 remove unused deprecating-warning-causing biopyton dependency 2026-05-07 17:15:15 +02:00
Robert Sachunsky
d8c83d6137 make_valid(): avoid oversimplification, improve parameter search 2026-05-05 15:00:16 +02:00
Robert Sachunsky
45868e99cd get_slopes_and_deskew_new_light2: ignore tiny contour areas 2026-05-04 15:55:00 +02:00
Robert Sachunsky
934ac90e92 get_slopes_and_deskew_new_light2: avoid +/- 90° cancellation…
in `estimate_skew_contours()`, distinguish between angle stats
scattering around <45° vs >45°: in the latter case, use modulo
180° for averages - to avoid cancelling out +90° with -90°
2026-05-04 15:52:07 +02:00
Robert Sachunsky
29bb55ceff return_deskew_slop: no >90° search unless for full page, simplify 2026-05-01 00:27:00 +02:00
Robert Sachunsky
d7a3f4cec6 training: add cfg param reload_weights for building but loading…
- introduce `config_params` key `reload_weights`
- add respective section for all model types:
  - build fresh model from code
  - load existing weights from `dir_of_start_model`
  - save to `dir_output` under same basename as existing model
    (but without optimizer and metrics; which does not work currently)
  - exit immediately (i.e. no actual training)
- reorder so reload_weights is after compilation but before data loading
2026-04-30 16:54:26 +02:00
Robert Sachunsky
cbb3be0e01 add diagnostic plotting for prediction masking (commented) 2026-04-30 16:12:00 +02:00
Robert Sachunsky
33c055389d bold run_single refactoring (predict segmentation on cropped img)…
- move `extract_page()` to the start (right after enhancement),
  so early layout and textline model prediction sees cropped
  image
- `extract_page()`: also return page mask
- `get_early_layout()`:
  * use cropped image
  * also run optional table prediction here,
    map table label and confidence already
    (so no need to pass these arrays everywhere)
  * suppress all non-text type regions in textline mask
  * also return text+table mask
    (so no need to reconstruct it everywhere)
- apply page mask to textline mask and early layout result
  (i.e. suppress areas beyond border contour)
- `run_graphics_and_columns()`:
  * rename → `run_columns()`
  * no table prediction here
  * no page extraction here
  * no page cropping+masking here
  * no textline mask suppression here
- `run_graphics_and_columns_without_layout()`: drop
  (not needed anymore)
- `run_marginals()` vs. `get_marginals()`: extract
  `text_mask` internally from early layout
- early page cropping for col-classifier:
  also use cropped image in input binarization mode
- early page cropping for col-classifier:
  get external contours instead of indiscriminate tree
- writer: skip layout mode now also uses cropped coordinates
  (so drop kwarg for it)
2026-04-30 16:12:00 +02:00
Robert Sachunsky
7e7cc6a801 do_order_of_regions(): use region mask instead of textline mask…
for local (within-box) ordering of region contours, use the same
text mask (merely eroded) as for the contour extraction itself:
the text+table+drop mask from early+full layout prediction,
rather than the textline mask, because the latter may be empty
in some boxes and is unlikely to be more useful than the region
mask itself
2026-04-30 16:11:59 +02:00
Robert Sachunsky
63df9be4db find_number_of_columns_in_document(): pass in (reuse) masks 2026-04-30 16:11:59 +02:00
Robert Sachunsky
da9e00cfe5 consistently handle textline mask with respect to drop-capital mask…
- suppress drop-capital in textline mask for textline contours
- elevate drop-capital in textline mask for reading order boxes
2026-04-30 16:11:59 +02:00
Robert Sachunsky
2641171fb1 return_boxes_...order_of_reading...: avoid negative slices…
fix rare bug when horizontal separators are detected
by the very top (of a major vertical part of the page),
causing box intervals to become negative
2026-04-30 16:11:59 +02:00
Robert Sachunsky
6a92f0d49c make get_deskewed_masks() unconditional, call only when needed 2026-04-30 16:11:59 +02:00
Robert Sachunsky
52eb4c9a0a move label definition and deskewing cancellation up 2026-04-30 16:11:59 +02:00
Robert Sachunsky
fa882e1dbe move run_boxes_order() call to RO section of run_single() 2026-04-30 16:11:59 +02:00
Robert Sachunsky
d88bd485ff get_slopes*(): does not need passing boxes separately 2026-04-30 16:11:59 +02:00
Robert Sachunsky
869646cbf5 get_full_layout() does not need the textline mask 2026-04-30 16:11:59 +02:00
Robert Sachunsky
b5bc161a4c extract_page(): get external contours instead of indiscriminate tree 2026-04-30 16:11:59 +02:00
Robert Sachunsky
287bebde0d get_marginals(): fix height factor for mask resizing 2026-04-30 16:11:59 +02:00
Robert Sachunsky
a031d590b8 get_marginals(): do allow both left and right point (f/u 4bdea39)…
(as there are valid cases where both left and right marginalia
 is present) follow-up 4bdea39 by re-allowing left point _and_
right point - but still score-based, and not if very asymmetric
2026-04-30 16:11:59 +02:00
Robert Sachunsky
9571ce3474 get_marginals(): reduce indentation 2026-04-30 16:11:52 +02:00
Robert Sachunsky
c18deb0722 drop relabelling all marginalia to main if no main (now unnecessary) 2026-04-30 16:09:03 +02:00
Robert Sachunsky
1f6db34adf run/get_marginals(): simplify and speed up…
- `get_marginals` modifies region labels in-place anyways,
  so no need for retval
- de/rotate only inside `get_marginals` (for consistency)
- return early if no marginals detected
- `run_marginals`: only useful in 1 or 2 columns, so keep to
  that conditional branch; allows avoiding unnecessary resizing
  of images to and fro
- rename `text_regions_p_1` → `text_regions_p`
2026-04-30 16:09:03 +02:00
Robert Sachunsky
45a43f7e5e get_marginals(): fixup point_right fallback 2026-04-30 16:08:15 +02:00
kba
0b8d8a7330 docker: core to 3.12.3 2026-04-29 17:20:36 +02:00
kba
ad5f22726e 🔥 require transformers >= 5 2026-04-29 17:06:13 +02:00
kba
f58189d5f4 ci: tag eynollah docker image with git tag version if possible, else latest 2026-04-29 16:34:06 +02:00
Robert Sachunsky
68ceeec764 get_marginals(): improve contour assignment…
- use undeskewed mask for contour comparisons
  instead of deskewed mask (less precise)
- rename `text_with_lines` → `text_mask_d`
- rename `mask_marginals` → `main_mask_d`
- rename `text_regions` → `early_layout`
- rename `...textline...` → `...text...`
2026-04-25 03:06:34 +02:00
Robert Sachunsky
6d55d0b87b get_marginals(): improve peak point threshold criterion…
in search of valid peaks (gaps between text columns),
- drop absolute values for minimum gap depth
  (likely crafted for some fixed resolution examples)
- instead, use criterion relative to maximum column depth
  and page height (trying to loosely approximate the prior
  constants, albeit somewhat more permissive)
2026-04-25 02:23:16 +02:00
Robert Sachunsky
4bdea39c98 get_marginals(): improve left/right point selection…
in search of valid (above threshold) peaks:
- do not just pick right-most left and left-most right span;
- instead,
  * if no peaks on the left, then only search right
  * if no peaks on the right, then only search left
  * if peaks on both sides, then only better side
    (so never return marginals on both sides!)
  * use scoring for peaks that reflects their peak
    prominence and peak height (but keep positional
    range constraints for what constitues left and right)
2026-04-25 01:59:48 +02:00
Robert Sachunsky
70bf461c30 get_marginals(): simplify, improve…
- rename `thickness_along_y_percent` →
  `max_textline_thickness_percent`
- rename `marginlas_should_be_main_text` →
  `main_text_should_be_marginals`
- constrain `find_peaks()` by prominence and distance
- simplify (a lot)
- add comments for possible improvements
  and for plotting
2026-04-25 01:52:21 +02:00