Commit graph

489 commits

Author SHA1 Message Date
Robert Sachunsky
5c12b6a851 combine_hor_lines_and_delete_cross_points: simplify and rename
- `x_width_smaller_than_acolumn_width` →
  `avg_col_width`
- `len_lines_bigger_than_x_width_smaller_than_acolumn_width` →
  `nseps_wider_than_than_avg_col_width`
- `img_in_hor` → `img_p_in_hor` (analogous to vertical)
2025-11-28 17:27:12 +01:00
Robert Sachunsky
06cb9d1d31 combine_hor_lines_and_delete_cross_points: fix 1-off px bug
when eroding the vertical separator mask (by slicing),
avoid leaving 1px strips
2025-11-28 17:08:39 +01:00
Robert Sachunsky
38d91673b1 combine_hor_lines_and_delete_cross_points: get external contours
instead of tree without looking at the actual hierarchy

(to prevent retrieving holes as separators)
2025-11-28 16:50:08 +01:00
Robert Sachunsky
ee59a6809d contours_in_same_horizon: fix 5d15941b 2025-11-28 16:17:09 +01:00
kba
b161e33854 🔥 refactor eynollah ocr
.
2025-11-28 15:45:21 +01:00
kba
30f9c695dc move line-gt extraction out of ocr to eynollah-training 2025-11-28 15:12:31 +01:00
kba
9bcfeab057 💀 remove dead code from eynollah.py 2025-11-28 12:52:28 +01:00
kba
5171e09c2d eynollah.py: fix kwargs to writer 2025-11-28 12:52:28 +01:00
kba
c24cf94bce enforce kwargs for writer.build_... 2025-11-28 12:52:28 +01:00
kba
4aa9543a7d remove more branches after textline_light default true 2025-11-27 11:30:00 +01:00
kba
177d555ded factor out extract_only_images as eynollah extract-images 2025-11-26 21:37:00 +01:00
kba
83e8b289da 🔥 drop light_version/textline_light (now default and implied) 2025-11-26 20:48:22 +01:00
kba
ca83cf934d fix imports from src/cli/cli_*/*_cli 2025-11-26 20:48:14 +01:00
kba
095b36c389 models: split into layout, extra and ocr
layout: Everything not OCR or extra
ocr: trocr/cnnrnn models
extra: obsolete or niche models
2025-11-26 19:49:59 +01:00
kba
000af16a47 🔥 remove torch pinning 2025-11-26 19:23:49 +01:00
kba
e503c1a0b7 drop obsolete multi-model binarization 2025-11-26 18:51:41 +01:00
kba
82266f8234 reorganize cli 2025-11-26 18:51:20 +01:00
kba
5a1900e664 🔥 remove OCR option from eynollah layout 2025-11-26 18:12:03 +01:00
kba
0f410c2e7c disable tf/keras logging on first import 2025-11-26 16:37:54 +01:00
kba
9d9d32daed update OCR-D bindings 2025-11-26 16:20:27 +01:00
Robert Sachunsky
e428e7ad78 ensure separators stay within image bounds 2025-11-16 16:35:18 +01:00
Robert Sachunsky
406288b1fe fixup 72d059f3: forgot to update other writer calls 2025-11-16 16:32:45 +01:00
Robert Sachunsky
028ed16921 adapt ocrd-sbb-binarize 2025-11-15 17:17:37 +01:00
Robert Sachunsky
49ab269e08 fix typos found by ruff 2025-11-15 15:49:51 +01:00
Robert Sachunsky
72d059f3c9 reading order: simplify assignment / counting
- `do_order_of_regions`: simplify aggregating per-box orders
  for paragraphs and headings to overall order passed to
  `xml_reading_order`; no need for `order_and_id_of_texts`,
  no need to return `id_of_texts_tot`
- `do_order_of_regions_with_model`: no need to return `region_ids`
- writer: no need to pass `id_of_texts_tot` in `build_pagexml`
2025-11-15 14:34:12 +01:00
Robert Sachunsky
5a778003fd contour matching for deskewed image: ensure matches for both sides 2025-11-15 14:32:22 +01:00
Robert Sachunsky
3c15c4f7d4 back to rotate_image instead of rotation_image_new for deskewing
(because the latter does not preserve coordinates;
 it scales, even when resizing the image;
 this caused coordinate problems when matching deskewed contours)
2025-11-15 14:29:41 +01:00
Robert Sachunsky
4475183f08 improve rules governing column split
- reduce `sigma` for smoothing of input to `find_peaks`
  (so we get deeper gaps between columns)
- allow column boundaries closer to the margins
  (50 instead of 100 or 200 px, 170 instead of 370 px)
- allow column boundaries closer to each other
  (300 instead of 400 px)
- add a secondary `grenze` criterion for depth of gap
  (relative to lowest minimum, if that is smaller than
   the old criterion relative to lowest maximum)
- for calls to `find_num_col` within parts of a page,
  do allow unbalanced column boundaries
2025-11-14 13:15:09 +01:00
Robert Sachunsky
4abc2ff572 rewrite/simplify manual reading order using recursive algorithm
- rename `return_x_start_end_mothers_childs_and_type_of_reading_order`
  → `return_multicol_separators_x_start_end`, and drop all the analysis
  pertaining to mother/child relationships and full-span separators,
  also drop the separator unification rules;
  instead of the latter, try to combine neighbouring separators more
  generally: join column spans iff there is nothing in between
  (which also necessitates passing the region mask), and keep only
  one of every such redundant pair;
  add the top (of each page part) as full-span separator up front,
  and return separators already ordered by y
- `return_boxes_of_images_by_order_of_reading_new`:
  - also pass regions with separators, so they do not have to be
    reconstructed from the separator coordinates, and also contain
    images and other non-text region types, when trying to elongate
    separators to maximize their span (without introducing overlaps)
  - determine connected components of the region mask, i.e. labels
    and their respective bboxes, in order to
    1. gain additional multi-column separators, if possible
    2. avoid cutting through regions which do cross column boundaries
       later on
  - whenever adding a new bbox, first look up the label map to see if
    there are any multi-column regions extending to the right of the
    current column; if there are, then advance not just one column
    to the right, but as many as necessary to avoid cutting through
    these regions
  - new core algorithm: iterate separators sorted by y and then column
    by column, but whenever the next separator ends in the same column
    as the current one or even further left, recurse (i.e. finish that
    span first before continuing with the top iteration)
2025-11-14 13:14:53 +01:00
Robert Sachunsky
95f76081d1 rename some more identifiers:
- `lines` → `seps` (to distinguish from textlines)
- `text_regions_p_1_n` → `text_regions_p_d` (because all other
  deskewed variables are called like this)
- `pixel` → `label`
2025-11-14 13:13:50 +01:00
Robert Sachunsky
1a76ce177d do_order_of_regions: round contour centers
(so we can be sure they do not fall through the
 "pixel cracks": bboxes are delimited by integers,
 and we do not want to assign contours between
 boxes)
2025-11-14 13:08:10 +01:00
vahidrezanezhad
ed5b5c13dd Add test images; call TrOCR processor from the same directory as the TrOCR model 2025-11-07 12:47:21 +01:00
kba
f902756ce1 try importing torch, then shapely, then tensorflow 2025-11-06 13:10:35 +01:00
kba
d224b0f7e8 try with shapely.set_precision(...mode="keep_collpased") 2025-11-06 11:55:40 +01:00
kba
9ab565fa02 model basedir might be a symlink 2025-10-29 21:02:42 +01:00
kba
4772fd17e2 missed changing override mechanism in eynollah_ocr 2025-10-29 20:47:13 +01:00
kba
29c273685f fix merge issues 2025-10-29 20:15:19 +01:00
kba
de76eabc1d Merge branch 'cli-logging' into model-zoo 2025-10-29 19:41:01 +01:00
kba
5e22e9db64 model_zoo: make type str to reduce importing overhead 2025-10-29 19:16:35 +01:00
kba
a913bdf7dc make --model-basedir and --model-overrides top-level CLI options 2025-10-29 18:48:41 +01:00
kba
b6f82c72b9 refactor cli tests 2025-10-29 17:23:21 +01:00
kba
ef999c8f0a Merge branch 'model-zoo' of lx0145.sbb.spk-berlin.de:/data/eynollah into model-zoo 2025-10-27 11:45:20 +01:00
kba
294b6356d3 wip 2025-10-27 11:45:16 +01:00
kba
51d2680d9c wip 2025-10-27 11:44:59 +01:00
Robert Sachunsky
19b2c3fa42 reading order: improve handling of headings and horizontal seps
- drop connected components analysis to test overlaps between
  horizontal separators and (horizontal) neighbours (introduced
  in ab17a927)
- instead of converting headings to topline and baseline during
  `find_number_of_columns_in_document` (introduced in 9f1595d7),
  add them to the matrix unchanged, but mark as extra type
  (besides horizontal and vertical separtors)
- convert headings to toplines and baselines no earlier than in
  `return_boxes_of_images_by_order_of_reading_new`
- for both headings and horizontal separators, if they already
  span multiple columns, check if they would overlap (horizontal)
  neighbours by looking at successively larger (left and right)
  intervals of columns (and pick the largest elongation which
  does not introduce any overlaps)
2025-10-25 13:36:35 +02:00
Robert Sachunsky
3367462d18 return_boxes_of_images_by_order_of_reading_new: change arg order 2025-10-25 13:36:24 +02:00
Robert Sachunsky
a2a9fe5117 delete_separator_around: simplify, eynollah: identifiers
- use array instead of list operations
- rename identifiers:
  - `pixel` → `label`
  - `line` → `sep`
2025-10-25 13:36:17 +02:00
Robert Sachunsky
3ebbc2d693 return_boxes_of_images_by_order_of_reading_new: indent
(by removing unnecessary conditional)
2025-10-25 13:36:06 +02:00
Robert Sachunsky
66a0e55e49 return_boxes_of_images_by_order_of_reading_new: avoid oversplits
when y slice (`top:bot`) is not a significant part of the page,
viz. less than 22% (as in `find_number_of_columns_in_document`),
avoid forcing `find_num_col` to reach `num_col_classifier`

(allows large headers not to be split up and thus better ordered)
2025-10-25 13:35:56 +02:00
Robert Sachunsky
6fbb5f8a12 return_boxes_of_images_by_order_of_reading_new: simplify
- array instead of list operations
- add better plotting (but commented out)
- add more debug printing (but commented out)
- add more inline comments for documentation
- rename identifiers to make more readable:
  - `cy_hor_diff` → `y_max_hor_some` (because the ymax gets passed)
  - `lines` → `seps`
  - `y_type_2` → `y_mid`
  - `y_diff_type_2` → `y_max`
  - `y_lines_by_order` → `y_mid_by_order`
  - `y_lines_without_mother` → `y_mid_without_mother`
  - `y_lines_with_child_without_mother` → `y_mid_with_child_without_mother`
  - `y_column` → `y_mid_column`
  - `y_column_nc` → `y_mid_column_nc`
  - `y_all_between_nm_wc` → `y_mid_between_nm_wc`
  - `lines_so_close_to_top_separator` → `seps_too_close_to_top_separator`
  - `y_in_cols` and `y_down` → `y_mid_next`
- use `pairwise()` `nc_top:nc_bot` instead of `i_c` indexing
2025-10-25 13:35:44 +02:00