Commit graph

1351 commits

Author SHA1 Message Date
Robert Sachunsky
3367462d18 return_boxes_of_images_by_order_of_reading_new: change arg order 2025-10-25 13:36:24 +02:00
Robert Sachunsky
a2a9fe5117 delete_separator_around: simplify, eynollah: identifiers
- use array instead of list operations
- rename identifiers:
  - `pixel` → `label`
  - `line` → `sep`
2025-10-25 13:36:17 +02:00
Robert Sachunsky
3ebbc2d693 return_boxes_of_images_by_order_of_reading_new: indent
(by removing unnecessary conditional)
2025-10-25 13:36:06 +02:00
Robert Sachunsky
66a0e55e49 return_boxes_of_images_by_order_of_reading_new: avoid oversplits
when y slice (`top:bot`) is not a significant part of the page,
viz. less than 22% (as in `find_number_of_columns_in_document`),
avoid forcing `find_num_col` to reach `num_col_classifier`

(allows large headers not to be split up and thus better ordered)
2025-10-25 13:35:56 +02:00
Robert Sachunsky
6fbb5f8a12 return_boxes_of_images_by_order_of_reading_new: simplify
- array instead of list operations
- add better plotting (but commented out)
- add more debug printing (but commented out)
- add more inline comments for documentation
- rename identifiers to make more readable:
  - `cy_hor_diff` → `y_max_hor_some` (because the ymax gets passed)
  - `lines` → `seps`
  - `y_type_2` → `y_mid`
  - `y_diff_type_2` → `y_max`
  - `y_lines_by_order` → `y_mid_by_order`
  - `y_lines_without_mother` → `y_mid_without_mother`
  - `y_lines_with_child_without_mother` → `y_mid_with_child_without_mother`
  - `y_column` → `y_mid_column`
  - `y_column_nc` → `y_mid_column_nc`
  - `y_all_between_nm_wc` → `y_mid_between_nm_wc`
  - `lines_so_close_to_top_separator` → `seps_too_close_to_top_separator`
  - `y_in_cols` and `y_down` → `y_mid_next`
- use `pairwise()` `nc_top:nc_bot` instead of `i_c` indexing
2025-10-25 13:35:44 +02:00
Robert Sachunsky
6cc5900943 find_num_col: add better plotting (but commented out) 2025-10-25 13:35:34 +02:00
Robert Sachunsky
5d15941b35 contours_in_same_horizon: simplify
- array instead of list operations
- return array of index pairs instead of list objects
2025-10-25 13:35:26 +02:00
Robert Sachunsky
acee4c1bfe find_number_of_columns_in_document: simplify 2025-10-25 13:35:18 +02:00
Robert Sachunsky
b2a79cc6ed return_x_start_end_mothers_childs_and_type_of_reading_order: fix+1
when calculating `reading_order_type`, upper limit on column range
(`x_end`) needs to be `+1` here as well
2025-10-25 13:35:12 +02:00
Robert Sachunsky
e2dfec75fb return_x_start_end_mothers_childs_and_type_of_reading_order:
simplify and document

- simplify
- rename identifiers to make readable:
  - `y_sep` → `y_mid` (because the cy gets passed)
  - `y_diff` → `y_max` (because the ymax gets passed)
- array instead of list operations
- add docstring and in-line comments
- return (zero-length) numpy array instead of empty list
2025-10-25 13:35:06 +02:00
Robert Sachunsky
0fc4b2535d return_boxes_of_images_by_order_of_reading_new: fix no-mother case
- when handling lines without mother,
  and biggest line already accounts for all columns,
  but some are too close to the top and therefore must be removed,
  avoid invalidating `biggest` index, causing `IndexError`
- remove try-catch (now unnecessary)
- array instead of list operations
2025-10-25 13:34:58 +02:00
Robert Sachunsky
7c3e418588 return_boxes_of_images_by_order_of_reading_new: simplify
- enumeration instead of indexing
- array instead of list operations
- add better plotting (but commented out)
2025-10-25 13:34:52 +02:00
Robert Sachunsky
cd35241e81 find_number_of_columns_in_document: split headings at top+baseline
regarding `splitter_y` result, for headings, instead of cutting right
through them via center line, add their toplines and baselines as if
they were horizontal separators
2025-10-25 13:34:35 +02:00
vahidrezanezhad
6192e5ba5c
qualitative evaluation of ocr models are added to docs 2025-10-23 16:37:24 +02:00
kba
ec1fd93dad wip 2025-10-23 11:58:23 +02:00
vahidrezanezhad
d0ad7a98b7 starting qualitative ocr evaluation 2025-10-22 22:45:22 +02:00
vahidrezanezhad
7b7714af2e completing ocr evaluations metric 2025-10-22 22:42:37 +02:00
vahidrezanezhad
b56bb44284 providing ocr model evaluation metrics 2025-10-22 21:30:06 +02:00
vahidrezanezhad
59eb4fd3be
images with ro are added to readme 2025-10-22 19:04:01 +02:00
vahidrezanezhad
ab9ddd5214
OCR examples are added to README 2025-10-22 18:41:15 +02:00
vahidrezanezhad
2fc723d292 extend README 2025-10-22 18:29:14 +02:00
kba
874cfc247f . 2025-10-22 17:56:18 +02:00
kba
883546a6b8 eynollah models package 2025-10-22 17:05:40 +02:00
kba
04bc4a63d0 reorganize model_zoo 2025-10-22 16:04:48 +02:00
kba
d94285b3ea rewrite model spec data structure 2025-10-22 13:07:35 +02:00
kba
146658f026 eynollah layout: fix trocr_processor model_zoo call 2025-10-22 10:48:26 +02:00
kba
4c8abfe19c eynollah_ocr: actually replace the model calls 2025-10-22 10:48:26 +02:00
kba
1337461d47 adopt image_enhancer to the zoo 2025-10-21 19:24:55 +02:00
kba
f0c86672f8 adopt mb_ro_on_layout to the zoo 2025-10-21 17:55:08 +02:00
kba
bcffa2e503 adopt binarizer to the zoo 2025-10-21 17:53:24 +02:00
kba
de34a15809 Makefile: fix make models for OCR 2025-10-21 17:27:16 +02:00
kba
9d2b18d2af test_run: check log messages starting with eynollah 2025-10-21 13:29:55 +02:00
kba
a53d5fc452 update docs/makefile to point to v0.6.0 models 2025-10-21 13:15:57 +02:00
kba
c6b863b13f typing and asserts 2025-10-21 12:05:27 +02:00
kba
44b75eb36f cli: model -> model_basedir 2025-10-21 11:05:12 +02:00
cneud
7d70835d22 small fixes to main readme 2025-10-20 23:19:10 +02:00
cneud
230e7cc705 integrate ocrd docs 2025-10-20 22:52:54 +02:00
cneud
e5254dc6c5 integrate training docs 2025-10-20 22:39:54 +02:00
cneud
6e3399fe7a combine Docker docs 2025-10-20 22:16:56 +02:00
kba
062f317d2e Introduce model_zoo to Eynollah_ocr 2025-10-20 21:14:52 +02:00
kba
d609a532bf organize imports mostly 2025-10-20 19:46:07 +02:00
kba
48d1198d24 move Eynollah_ocr to separate module 2025-10-20 19:15:31 +02:00
kba
b90cfdfcc4 adapt tests to -l being top-level option now 2025-10-20 18:56:24 +02:00
kba
a850ef39ea factor model loading in Eynollah to EynollahModelZoo 2025-10-20 18:34:44 +02:00
Robert Sachunsky
5a0e4c3b0f find_number_of_columns_in_document: improve splitter rule
extend horizontal separators to full img width if they do not overlap
any other regions

(only as regards to returned `splitter_y` result,
 but without changing returned separators mask)
2025-10-20 17:41:50 +02:00
Robert Sachunsky
542d38ab43 find_number_of_columns_in_document: simplify, rename lineseps 2025-10-20 17:41:49 +02:00
Robert Sachunsky
d3d599b010 order_of_regions: add better plotting (but commented out) 2025-10-20 17:41:47 +02:00
Robert Sachunsky
c43a825d1d order_of_regions: filter out-of-image peaks 2025-10-20 17:41:47 +02:00
Robert Sachunsky
48761c3e12 find_num_col: simplify, add better plotting (but commented out) 2025-10-20 17:41:45 +02:00
Robert Sachunsky
184927fb54 find_num_cols: re-sort peaks when cutting n-best num_col_classifier 2025-10-20 17:41:44 +02:00