Commit graph

560 commits

Author SHA1 Message Date
Robert Sachunsky
95f76081d1 rename some more identifiers:
- `lines` → `seps` (to distinguish from textlines)
- `text_regions_p_1_n` → `text_regions_p_d` (because all other
  deskewed variables are called like this)
- `pixel` → `label`
2025-11-14 13:13:50 +01:00
Robert Sachunsky
1a76ce177d do_order_of_regions: round contour centers
(so we can be sure they do not fall through the
 "pixel cracks": bboxes are delimited by integers,
 and we do not want to assign contours between
 boxes)
2025-11-14 13:08:10 +01:00
vahidrezanezhad
ed5b5c13dd Add test images; call TrOCR processor from the same directory as the TrOCR model 2025-11-07 12:47:21 +01:00
kba
f902756ce1 try importing torch, then shapely, then tensorflow 2025-11-06 13:10:35 +01:00
kba
d224b0f7e8 try with shapely.set_precision(...mode="keep_collpased") 2025-11-06 11:55:40 +01:00
kba
9ab565fa02 model basedir might be a symlink 2025-10-29 21:02:42 +01:00
kba
4772fd17e2 missed changing override mechanism in eynollah_ocr 2025-10-29 20:47:13 +01:00
kba
29c273685f fix merge issues 2025-10-29 20:15:19 +01:00
kba
de76eabc1d Merge branch 'cli-logging' into model-zoo 2025-10-29 19:41:01 +01:00
kba
5e22e9db64 model_zoo: make type str to reduce importing overhead 2025-10-29 19:16:35 +01:00
kba
a913bdf7dc make --model-basedir and --model-overrides top-level CLI options 2025-10-29 18:48:41 +01:00
kba
b6f82c72b9 refactor cli tests 2025-10-29 17:23:21 +01:00
kba
ef999c8f0a Merge branch 'model-zoo' of lx0145.sbb.spk-berlin.de:/data/eynollah into model-zoo 2025-10-27 11:45:20 +01:00
kba
294b6356d3 wip 2025-10-27 11:45:16 +01:00
kba
51d2680d9c wip 2025-10-27 11:44:59 +01:00
Robert Sachunsky
19b2c3fa42 reading order: improve handling of headings and horizontal seps
- drop connected components analysis to test overlaps between
  horizontal separators and (horizontal) neighbours (introduced
  in ab17a927)
- instead of converting headings to topline and baseline during
  `find_number_of_columns_in_document` (introduced in 9f1595d7),
  add them to the matrix unchanged, but mark as extra type
  (besides horizontal and vertical separtors)
- convert headings to toplines and baselines no earlier than in
  `return_boxes_of_images_by_order_of_reading_new`
- for both headings and horizontal separators, if they already
  span multiple columns, check if they would overlap (horizontal)
  neighbours by looking at successively larger (left and right)
  intervals of columns (and pick the largest elongation which
  does not introduce any overlaps)
2025-10-25 13:36:35 +02:00
Robert Sachunsky
3367462d18 return_boxes_of_images_by_order_of_reading_new: change arg order 2025-10-25 13:36:24 +02:00
Robert Sachunsky
a2a9fe5117 delete_separator_around: simplify, eynollah: identifiers
- use array instead of list operations
- rename identifiers:
  - `pixel` → `label`
  - `line` → `sep`
2025-10-25 13:36:17 +02:00
Robert Sachunsky
3ebbc2d693 return_boxes_of_images_by_order_of_reading_new: indent
(by removing unnecessary conditional)
2025-10-25 13:36:06 +02:00
Robert Sachunsky
66a0e55e49 return_boxes_of_images_by_order_of_reading_new: avoid oversplits
when y slice (`top:bot`) is not a significant part of the page,
viz. less than 22% (as in `find_number_of_columns_in_document`),
avoid forcing `find_num_col` to reach `num_col_classifier`

(allows large headers not to be split up and thus better ordered)
2025-10-25 13:35:56 +02:00
Robert Sachunsky
6fbb5f8a12 return_boxes_of_images_by_order_of_reading_new: simplify
- array instead of list operations
- add better plotting (but commented out)
- add more debug printing (but commented out)
- add more inline comments for documentation
- rename identifiers to make more readable:
  - `cy_hor_diff` → `y_max_hor_some` (because the ymax gets passed)
  - `lines` → `seps`
  - `y_type_2` → `y_mid`
  - `y_diff_type_2` → `y_max`
  - `y_lines_by_order` → `y_mid_by_order`
  - `y_lines_without_mother` → `y_mid_without_mother`
  - `y_lines_with_child_without_mother` → `y_mid_with_child_without_mother`
  - `y_column` → `y_mid_column`
  - `y_column_nc` → `y_mid_column_nc`
  - `y_all_between_nm_wc` → `y_mid_between_nm_wc`
  - `lines_so_close_to_top_separator` → `seps_too_close_to_top_separator`
  - `y_in_cols` and `y_down` → `y_mid_next`
- use `pairwise()` `nc_top:nc_bot` instead of `i_c` indexing
2025-10-25 13:35:44 +02:00
Robert Sachunsky
6cc5900943 find_num_col: add better plotting (but commented out) 2025-10-25 13:35:34 +02:00
Robert Sachunsky
5d15941b35 contours_in_same_horizon: simplify
- array instead of list operations
- return array of index pairs instead of list objects
2025-10-25 13:35:26 +02:00
Robert Sachunsky
acee4c1bfe find_number_of_columns_in_document: simplify 2025-10-25 13:35:18 +02:00
Robert Sachunsky
b2a79cc6ed return_x_start_end_mothers_childs_and_type_of_reading_order: fix+1
when calculating `reading_order_type`, upper limit on column range
(`x_end`) needs to be `+1` here as well
2025-10-25 13:35:12 +02:00
Robert Sachunsky
e2dfec75fb return_x_start_end_mothers_childs_and_type_of_reading_order:
simplify and document

- simplify
- rename identifiers to make readable:
  - `y_sep` → `y_mid` (because the cy gets passed)
  - `y_diff` → `y_max` (because the ymax gets passed)
- array instead of list operations
- add docstring and in-line comments
- return (zero-length) numpy array instead of empty list
2025-10-25 13:35:06 +02:00
Robert Sachunsky
0fc4b2535d return_boxes_of_images_by_order_of_reading_new: fix no-mother case
- when handling lines without mother,
  and biggest line already accounts for all columns,
  but some are too close to the top and therefore must be removed,
  avoid invalidating `biggest` index, causing `IndexError`
- remove try-catch (now unnecessary)
- array instead of list operations
2025-10-25 13:34:58 +02:00
Robert Sachunsky
7c3e418588 return_boxes_of_images_by_order_of_reading_new: simplify
- enumeration instead of indexing
- array instead of list operations
- add better plotting (but commented out)
2025-10-25 13:34:52 +02:00
Robert Sachunsky
cd35241e81 find_number_of_columns_in_document: split headings at top+baseline
regarding `splitter_y` result, for headings, instead of cutting right
through them via center line, add their toplines and baselines as if
they were horizontal separators
2025-10-25 13:34:35 +02:00
kba
ec1fd93dad wip 2025-10-23 11:58:23 +02:00
kba
883546a6b8 eynollah models package 2025-10-22 17:05:40 +02:00
kba
04bc4a63d0 reorganize model_zoo 2025-10-22 16:04:48 +02:00
kba
d94285b3ea rewrite model spec data structure 2025-10-22 13:07:35 +02:00
kba
146658f026 eynollah layout: fix trocr_processor model_zoo call 2025-10-22 10:48:26 +02:00
kba
4c8abfe19c eynollah_ocr: actually replace the model calls 2025-10-22 10:48:26 +02:00
kba
1337461d47 adopt image_enhancer to the zoo 2025-10-21 19:24:55 +02:00
kba
f0c86672f8 adopt mb_ro_on_layout to the zoo 2025-10-21 17:55:08 +02:00
kba
bcffa2e503 adopt binarizer to the zoo 2025-10-21 17:53:24 +02:00
kba
a53d5fc452 update docs/makefile to point to v0.6.0 models 2025-10-21 13:15:57 +02:00
kba
c6b863b13f typing and asserts 2025-10-21 12:05:27 +02:00
kba
44b75eb36f cli: model -> model_basedir 2025-10-21 11:05:12 +02:00
kba
062f317d2e Introduce model_zoo to Eynollah_ocr 2025-10-20 21:14:52 +02:00
kba
d609a532bf organize imports mostly 2025-10-20 19:46:07 +02:00
kba
48d1198d24 move Eynollah_ocr to separate module 2025-10-20 19:15:31 +02:00
kba
a850ef39ea factor model loading in Eynollah to EynollahModelZoo 2025-10-20 18:34:44 +02:00
Robert Sachunsky
5a0e4c3b0f find_number_of_columns_in_document: improve splitter rule
extend horizontal separators to full img width if they do not overlap
any other regions

(only as regards to returned `splitter_y` result,
 but without changing returned separators mask)
2025-10-20 17:41:50 +02:00
Robert Sachunsky
542d38ab43 find_number_of_columns_in_document: simplify, rename lineseps 2025-10-20 17:41:49 +02:00
Robert Sachunsky
d3d599b010 order_of_regions: add better plotting (but commented out) 2025-10-20 17:41:47 +02:00
Robert Sachunsky
c43a825d1d order_of_regions: filter out-of-image peaks 2025-10-20 17:41:47 +02:00
Robert Sachunsky
48761c3e12 find_num_col: simplify, add better plotting (but commented out) 2025-10-20 17:41:45 +02:00