Commit graph

1359 commits

Author SHA1 Message Date
vahidrezanezhad
f9695cd7be Merge branch 'adding-cnn-rnn-training-script' of https://github.com/qurator-spk/eynollah into adding-cnn-rnn-training-script 2026-01-28 11:52:36 +01:00
vahidrezanezhad
3500167870 weights ensembling for tensorflow models is integrated 2026-01-28 11:52:12 +01:00
vahidrezanezhad
33f6a231bc fix: prevent crash when printspace is missing in xmls used for label generation 2026-01-26 17:30:26 +01:00
vahidrezanezhad
6ae244bf9b Fix filename stem extraction using binarization. Restore the CNN-RNN model to its previous version, as setting channels_last alone was insufficient for running on both CPU and GPU. Prevent errors caused by null values in image shape elements. 2026-01-26 15:04:47 +01:00
vahidrezanezhad
30f39e7383 mapregion is added to labels 2026-01-26 13:56:34 +01:00
vahidrezanezhad
c8240905a8 Fix label generation by selecting largest contour when erosion splits shapes 2026-01-26 13:36:24 +01:00
Robert Sachunsky
3c3effcfda drop TF1 vernacular, relax TF/Keras and Torch requirements…
- do not restrict TF version, but depend on tf-keras and
  set `TF_USE_LEGACY_KERAS=1` to avoid Keras 3 behaviour
- relax Numpy version requirement up to v2
- relax Torch version requirement
- drop TF1 session management code
- drop TF1 config in favour of TF2 config code for memory growth
- training.*: also simplify and limit line length
- training.train: always train with TensorBoard callback
2026-01-20 11:34:02 +01:00
Robert Sachunsky
e2754da4f5 adapt to Numpy 1.25 changes…
(esp. `np.array(...)` now not allowed on ragged arrays unless
 `dtype=object`, but then coercing sub-arrays to `object` as well)
2026-01-20 04:04:07 +01:00
kba
9ccc495b4a wip 2025-12-19 14:57:10 +01:00
vahidrezanezhad
49261fa99b CNN–RNN–OCR inference and adaptation of the CNN–RNN–OCR model to support inference on both CPU and GPU 2025-12-17 15:12:39 +01:00
vahidrezanezhad
6ee79c7320 evaluation with a given GT is only possible for segmentation tasks 2025-12-17 13:28:02 +01:00
vahidrezanezhad
4651000191 debuging input shape + enable finetuning a model 2025-12-15 11:36:09 +01:00
vahidrezanezhad
4fc3ff33cb The cnn-rnn ocr model can be trained now 2025-12-09 17:22:12 +01:00
vahidrezanezhad
84a72a128b cnn-rnn model can be called - model input height and width are dynamic now - data generator is also callable 2025-12-09 15:30:19 +01:00
vahidrezanezhad
59e5a73654 adding cnn-rnn training script 2025-12-08 19:30:57 +01:00
vahidrezanezhad
7bf5e077d9 Restore correct execution of export_textline_images_and_text 2025-12-03 15:40:52 +01:00
vahidrezanezhad
6ac37af2f8 Fix eynollah ocr --help so it works again 2025-12-03 14:11:47 +01:00
vahidrezanezhad
d687d862d6 Restored correct functionality of the extract_only_images mode and cleaned up the argument handling 2025-12-03 12:01:42 +01:00
Robert Sachunsky
9fdae72e96 utils_ocr.return_textline_contour: gen cv2-like contours (w/ ndim=3, as in all other places) 2025-12-03 03:04:46 +01:00
Robert Sachunsky
ad8f8167c2 separate_lines/_vertical: gen cv2-like contours (w/ ndim=3, as in all other places) 2025-12-03 00:58:26 +01:00
Robert Sachunsky
43a95842bd writer: also ensure validity after scaling 2025-12-02 16:35:32 +01:00
kba
51abe9617a log to STDERR not STDOUT 2025-12-02 15:00:33 +01:00
Robert Sachunsky
56e73bf72f deskewing: add a 2nd stage for precision
after selecting the optimum angle on the original
search range, narrow down around in the vicinity
with half the range (adding computational costs,
but gaining precision)
2025-11-28 18:27:58 +01:00
Robert Sachunsky
adcea47bc0 return_boxes_of_images_by_order_of_reading_new: always erode
when passing the text region mask, do not apply erosion only
if there are more than 2 columns, but iff `not erosion_hurts`
(consistent with `find_num_col`'s expectations and making
 it as easy to find the column gaps on 1 and 2-column pages
 as on multi-column pages)
2025-11-28 18:23:59 +01:00
Robert Sachunsky
5a3de3b42d column detection: improve, aided by vseps whenever possible
- `find_number_of_columns_in_document`: retain vertical separators
  and pass to `find_num_col` for each vertical split
- `return_boxes_of_images_by_order_of_reading_new`: reconstruct
  the vertical separators from the segmentation mask and the separator
  bboxes; pass it on to `find_num_col` everywhere
- `return_boxes_of_images_by_order_of_reading_new`: no need to
  try-catch `find_num_col` anymore
- `return_boxes_of_images_by_order_of_reading_new`: when a vertical
  split has too few columns,
  * do not raise but lower the threshold `multiplier` responsible for
    allowing gaps as column boundaries
  * do not pass the `num_col_classifier` (i.e. expected number of
    resulting columns) of the entire page to the iterative
    `find_num_col` for each existing column, but only the portion
    of that span
2025-11-28 18:14:24 +01:00
Robert Sachunsky
4dd40c542b find_num_col: add optional criterion - sum of vertical separators
when searching for gaps between text regions, consider the vertical
separator mask (if given): add the vertical sum of vertical separators
to the peak scores (making column detection more robust if still slighly
skewed or partially obscured by multi-column regions, but fg seps are
present)
2025-11-28 18:07:15 +01:00
Robert Sachunsky
84d10962f3 return_boxes_of_images_by_order_of_reading_new: improve
- when searching for multi-col box makers, pick the right-most
  allowable column, not the left-most
2025-11-28 18:04:12 +01:00
Robert Sachunsky
5abf0c1097 return_boxes_of_images_by_order_of_reading_new: improve
- when analysing regions spanning across columns,
  disregard tiny regions (smaller than half the median size)
- if a region spans across columns just by a tiny fraction,
  and therefore is not good enough for a multi-col separator,
  then it should also not be good enough for a multi-col box
  maker
2025-11-28 17:58:44 +01:00
Robert Sachunsky
b71bb80e3a return_boxes_of_images_by_order_of_reading_new: fix 4abc2ff5
(forgot to also flip `regions_with_separators` if right2left)
2025-11-28 17:57:10 +01:00
Robert Sachunsky
a527d7a10d combine_hor_lines_and_delete_cross_points: improve
- avoid unnecessary `fillPoly` (we already have the mask)
- do not merge hseps if vseps interfere
- remove old criterion (based on total length of hseps)
- create new criterion (no x overlap and x close to each other)
- rename identifiers:
  * `sum_dis` → `sum_xspan`
  * `diff_max_min_uniques` → `tot_xspan`
  * np.std / np.mean → `dev_xspan`
- remove rule cutting around the center of crossing seps
  (which is unnecessary and creates small isolated seps
  at the center, unrelated to the actual crossing points)
- create rule cutting hseps by vseps _prior_ to merging
2025-11-28 17:34:11 +01:00
Robert Sachunsky
5c12b6a851 combine_hor_lines_and_delete_cross_points: simplify and rename
- `x_width_smaller_than_acolumn_width` →
  `avg_col_width`
- `len_lines_bigger_than_x_width_smaller_than_acolumn_width` →
  `nseps_wider_than_than_avg_col_width`
- `img_in_hor` → `img_p_in_hor` (analogous to vertical)
2025-11-28 17:27:12 +01:00
Robert Sachunsky
06cb9d1d31 combine_hor_lines_and_delete_cross_points: fix 1-off px bug
when eroding the vertical separator mask (by slicing),
avoid leaving 1px strips
2025-11-28 17:08:39 +01:00
Robert Sachunsky
38d91673b1 combine_hor_lines_and_delete_cross_points: get external contours
instead of tree without looking at the actual hierarchy

(to prevent retrieving holes as separators)
2025-11-28 16:50:08 +01:00
Robert Sachunsky
ee59a6809d contours_in_same_horizon: fix 5d15941b 2025-11-28 16:17:09 +01:00
kba
b161e33854 🔥 refactor eynollah ocr
.
2025-11-28 15:45:21 +01:00
kba
30f9c695dc move line-gt extraction out of ocr to eynollah-training 2025-11-28 15:12:31 +01:00
kba
951bd2fce6 CI: do not upgrade (now-unpineed) torch 2025-11-28 15:12:31 +01:00
kba
9bcfeab057 💀 remove dead code from eynollah.py 2025-11-28 12:52:28 +01:00
kba
5171e09c2d eynollah.py: fix kwargs to writer 2025-11-28 12:52:28 +01:00
kba
c24cf94bce enforce kwargs for writer.build_... 2025-11-28 12:52:28 +01:00
kba
4aa9543a7d remove more branches after textline_light default true 2025-11-27 11:30:00 +01:00
kba
177d555ded factor out extract_only_images as eynollah extract-images 2025-11-26 21:37:00 +01:00
kba
83e8b289da 🔥 drop light_version/textline_light (now default and implied) 2025-11-26 20:48:22 +01:00
kba
ca83cf934d fix imports from src/cli/cli_*/*_cli 2025-11-26 20:48:14 +01:00
kba
095b36c389 models: split into layout, extra and ocr
layout: Everything not OCR or extra
ocr: trocr/cnnrnn models
extra: obsolete or niche models
2025-11-26 19:49:59 +01:00
kba
000af16a47 🔥 remove torch pinning 2025-11-26 19:23:49 +01:00
kba
e503c1a0b7 drop obsolete multi-model binarization 2025-11-26 18:51:41 +01:00
kba
82266f8234 reorganize cli 2025-11-26 18:51:20 +01:00
kba
5a1900e664 🔥 remove OCR option from eynollah layout 2025-11-26 18:12:03 +01:00
kba
0f410c2e7c disable tf/keras logging on first import 2025-11-26 16:37:54 +01:00