Commit graph

1355 commits

Author SHA1 Message Date
kba
ef999c8f0a Merge branch 'model-zoo' of lx0145.sbb.spk-berlin.de:/data/eynollah into model-zoo 2025-10-27 11:45:20 +01:00
kba
294b6356d3 wip 2025-10-27 11:45:16 +01:00
kba
51d2680d9c wip 2025-10-27 11:44:59 +01:00
Robert Sachunsky
19b2c3fa42 reading order: improve handling of headings and horizontal seps
- drop connected components analysis to test overlaps between
  horizontal separators and (horizontal) neighbours (introduced
  in ab17a927)
- instead of converting headings to topline and baseline during
  `find_number_of_columns_in_document` (introduced in 9f1595d7),
  add them to the matrix unchanged, but mark as extra type
  (besides horizontal and vertical separtors)
- convert headings to toplines and baselines no earlier than in
  `return_boxes_of_images_by_order_of_reading_new`
- for both headings and horizontal separators, if they already
  span multiple columns, check if they would overlap (horizontal)
  neighbours by looking at successively larger (left and right)
  intervals of columns (and pick the largest elongation which
  does not introduce any overlaps)
2025-10-25 13:36:35 +02:00
Robert Sachunsky
3367462d18 return_boxes_of_images_by_order_of_reading_new: change arg order 2025-10-25 13:36:24 +02:00
Robert Sachunsky
a2a9fe5117 delete_separator_around: simplify, eynollah: identifiers
- use array instead of list operations
- rename identifiers:
  - `pixel` → `label`
  - `line` → `sep`
2025-10-25 13:36:17 +02:00
Robert Sachunsky
3ebbc2d693 return_boxes_of_images_by_order_of_reading_new: indent
(by removing unnecessary conditional)
2025-10-25 13:36:06 +02:00
Robert Sachunsky
66a0e55e49 return_boxes_of_images_by_order_of_reading_new: avoid oversplits
when y slice (`top:bot`) is not a significant part of the page,
viz. less than 22% (as in `find_number_of_columns_in_document`),
avoid forcing `find_num_col` to reach `num_col_classifier`

(allows large headers not to be split up and thus better ordered)
2025-10-25 13:35:56 +02:00
Robert Sachunsky
6fbb5f8a12 return_boxes_of_images_by_order_of_reading_new: simplify
- array instead of list operations
- add better plotting (but commented out)
- add more debug printing (but commented out)
- add more inline comments for documentation
- rename identifiers to make more readable:
  - `cy_hor_diff` → `y_max_hor_some` (because the ymax gets passed)
  - `lines` → `seps`
  - `y_type_2` → `y_mid`
  - `y_diff_type_2` → `y_max`
  - `y_lines_by_order` → `y_mid_by_order`
  - `y_lines_without_mother` → `y_mid_without_mother`
  - `y_lines_with_child_without_mother` → `y_mid_with_child_without_mother`
  - `y_column` → `y_mid_column`
  - `y_column_nc` → `y_mid_column_nc`
  - `y_all_between_nm_wc` → `y_mid_between_nm_wc`
  - `lines_so_close_to_top_separator` → `seps_too_close_to_top_separator`
  - `y_in_cols` and `y_down` → `y_mid_next`
- use `pairwise()` `nc_top:nc_bot` instead of `i_c` indexing
2025-10-25 13:35:44 +02:00
Robert Sachunsky
6cc5900943 find_num_col: add better plotting (but commented out) 2025-10-25 13:35:34 +02:00
Robert Sachunsky
5d15941b35 contours_in_same_horizon: simplify
- array instead of list operations
- return array of index pairs instead of list objects
2025-10-25 13:35:26 +02:00
Robert Sachunsky
acee4c1bfe find_number_of_columns_in_document: simplify 2025-10-25 13:35:18 +02:00
Robert Sachunsky
b2a79cc6ed return_x_start_end_mothers_childs_and_type_of_reading_order: fix+1
when calculating `reading_order_type`, upper limit on column range
(`x_end`) needs to be `+1` here as well
2025-10-25 13:35:12 +02:00
Robert Sachunsky
e2dfec75fb return_x_start_end_mothers_childs_and_type_of_reading_order:
simplify and document

- simplify
- rename identifiers to make readable:
  - `y_sep` → `y_mid` (because the cy gets passed)
  - `y_diff` → `y_max` (because the ymax gets passed)
- array instead of list operations
- add docstring and in-line comments
- return (zero-length) numpy array instead of empty list
2025-10-25 13:35:06 +02:00
Robert Sachunsky
0fc4b2535d return_boxes_of_images_by_order_of_reading_new: fix no-mother case
- when handling lines without mother,
  and biggest line already accounts for all columns,
  but some are too close to the top and therefore must be removed,
  avoid invalidating `biggest` index, causing `IndexError`
- remove try-catch (now unnecessary)
- array instead of list operations
2025-10-25 13:34:58 +02:00
Robert Sachunsky
7c3e418588 return_boxes_of_images_by_order_of_reading_new: simplify
- enumeration instead of indexing
- array instead of list operations
- add better plotting (but commented out)
2025-10-25 13:34:52 +02:00
Robert Sachunsky
cd35241e81 find_number_of_columns_in_document: split headings at top+baseline
regarding `splitter_y` result, for headings, instead of cutting right
through them via center line, add their toplines and baselines as if
they were horizontal separators
2025-10-25 13:34:35 +02:00
vahidrezanezhad
6192e5ba5c
qualitative evaluation of ocr models are added to docs 2025-10-23 16:37:24 +02:00
kba
ec1fd93dad wip 2025-10-23 11:58:23 +02:00
vahidrezanezhad
d0ad7a98b7 starting qualitative ocr evaluation 2025-10-22 22:45:22 +02:00
vahidrezanezhad
7b7714af2e completing ocr evaluations metric 2025-10-22 22:42:37 +02:00
vahidrezanezhad
b56bb44284 providing ocr model evaluation metrics 2025-10-22 21:30:06 +02:00
vahidrezanezhad
59eb4fd3be
images with ro are added to readme 2025-10-22 19:04:01 +02:00
vahidrezanezhad
ab9ddd5214
OCR examples are added to README 2025-10-22 18:41:15 +02:00
vahidrezanezhad
2fc723d292 extend README 2025-10-22 18:29:14 +02:00
kba
874cfc247f . 2025-10-22 17:56:18 +02:00
kba
883546a6b8 eynollah models package 2025-10-22 17:05:40 +02:00
kba
04bc4a63d0 reorganize model_zoo 2025-10-22 16:04:48 +02:00
kba
d94285b3ea rewrite model spec data structure 2025-10-22 13:07:35 +02:00
kba
146658f026 eynollah layout: fix trocr_processor model_zoo call 2025-10-22 10:48:26 +02:00
kba
4c8abfe19c eynollah_ocr: actually replace the model calls 2025-10-22 10:48:26 +02:00
kba
1337461d47 adopt image_enhancer to the zoo 2025-10-21 19:24:55 +02:00
kba
f0c86672f8 adopt mb_ro_on_layout to the zoo 2025-10-21 17:55:08 +02:00
kba
bcffa2e503 adopt binarizer to the zoo 2025-10-21 17:53:24 +02:00
kba
de34a15809 Makefile: fix make models for OCR 2025-10-21 17:27:16 +02:00
kba
9d2b18d2af test_run: check log messages starting with eynollah 2025-10-21 13:29:55 +02:00
kba
a53d5fc452 update docs/makefile to point to v0.6.0 models 2025-10-21 13:15:57 +02:00
kba
c6b863b13f typing and asserts 2025-10-21 12:05:27 +02:00
kba
44b75eb36f cli: model -> model_basedir 2025-10-21 11:05:12 +02:00
cneud
7d70835d22 small fixes to main readme 2025-10-20 23:19:10 +02:00
cneud
230e7cc705 integrate ocrd docs 2025-10-20 22:52:54 +02:00
cneud
e5254dc6c5 integrate training docs 2025-10-20 22:39:54 +02:00
cneud
6e3399fe7a combine Docker docs 2025-10-20 22:16:56 +02:00
kba
062f317d2e Introduce model_zoo to Eynollah_ocr 2025-10-20 21:14:52 +02:00
kba
d609a532bf organize imports mostly 2025-10-20 19:46:07 +02:00
kba
48d1198d24 move Eynollah_ocr to separate module 2025-10-20 19:15:31 +02:00
kba
b90cfdfcc4 adapt tests to -l being top-level option now 2025-10-20 18:56:24 +02:00
kba
a850ef39ea factor model loading in Eynollah to EynollahModelZoo 2025-10-20 18:34:44 +02:00
Robert Sachunsky
5a0e4c3b0f find_number_of_columns_in_document: improve splitter rule
extend horizontal separators to full img width if they do not overlap
any other regions

(only as regards to returned `splitter_y` result,
 but without changing returned separators mask)
2025-10-20 17:41:50 +02:00
Robert Sachunsky
542d38ab43 find_number_of_columns_in_document: simplify, rename lineseps 2025-10-20 17:41:49 +02:00