kba
b6f82c72b9
refactor cli tests
2025-10-29 17:23:21 +01:00
cneud
22d61e8d94
remove newspaper images from main readme
2025-10-28 19:56:23 +01:00
cneud
8822da17cf
Merge remote-tracking branch 'origin/updating_docs' into docs_and_minor_fixes
2025-10-28 19:53:12 +01:00
kba
ef999c8f0a
Merge branch 'model-zoo' of lx0145.sbb.spk-berlin.de:/data/eynollah into model-zoo
2025-10-27 11:45:20 +01:00
kba
294b6356d3
wip
2025-10-27 11:45:16 +01:00
kba
51d2680d9c
wip
2025-10-27 11:44:59 +01:00
Robert Sachunsky
19b2c3fa42
reading order: improve handling of headings and horizontal seps
...
- drop connected components analysis to test overlaps between
horizontal separators and (horizontal) neighbours (introduced
in ab17a927)
- instead of converting headings to topline and baseline during
`find_number_of_columns_in_document` (introduced in 9f1595d7),
add them to the matrix unchanged, but mark as extra type
(besides horizontal and vertical separtors)
- convert headings to toplines and baselines no earlier than in
`return_boxes_of_images_by_order_of_reading_new`
- for both headings and horizontal separators, if they already
span multiple columns, check if they would overlap (horizontal)
neighbours by looking at successively larger (left and right)
intervals of columns (and pick the largest elongation which
does not introduce any overlaps)
2025-10-25 13:36:35 +02:00
Robert Sachunsky
3367462d18
return_boxes_of_images_by_order_of_reading_new: change arg order
2025-10-25 13:36:24 +02:00
Robert Sachunsky
a2a9fe5117
delete_separator_around: simplify, eynollah: identifiers
...
- use array instead of list operations
- rename identifiers:
- `pixel` → `label`
- `line` → `sep`
2025-10-25 13:36:17 +02:00
Robert Sachunsky
3ebbc2d693
return_boxes_of_images_by_order_of_reading_new: indent
...
(by removing unnecessary conditional)
2025-10-25 13:36:06 +02:00
Robert Sachunsky
66a0e55e49
return_boxes_of_images_by_order_of_reading_new: avoid oversplits
...
when y slice (`top:bot`) is not a significant part of the page,
viz. less than 22% (as in `find_number_of_columns_in_document`),
avoid forcing `find_num_col` to reach `num_col_classifier`
(allows large headers not to be split up and thus better ordered)
2025-10-25 13:35:56 +02:00
Robert Sachunsky
6fbb5f8a12
return_boxes_of_images_by_order_of_reading_new: simplify
...
- array instead of list operations
- add better plotting (but commented out)
- add more debug printing (but commented out)
- add more inline comments for documentation
- rename identifiers to make more readable:
- `cy_hor_diff` → `y_max_hor_some` (because the ymax gets passed)
- `lines` → `seps`
- `y_type_2` → `y_mid`
- `y_diff_type_2` → `y_max`
- `y_lines_by_order` → `y_mid_by_order`
- `y_lines_without_mother` → `y_mid_without_mother`
- `y_lines_with_child_without_mother` → `y_mid_with_child_without_mother`
- `y_column` → `y_mid_column`
- `y_column_nc` → `y_mid_column_nc`
- `y_all_between_nm_wc` → `y_mid_between_nm_wc`
- `lines_so_close_to_top_separator` → `seps_too_close_to_top_separator`
- `y_in_cols` and `y_down` → `y_mid_next`
- use `pairwise()` `nc_top:nc_bot` instead of `i_c` indexing
2025-10-25 13:35:44 +02:00
Robert Sachunsky
6cc5900943
find_num_col: add better plotting (but commented out)
2025-10-25 13:35:34 +02:00
Robert Sachunsky
5d15941b35
contours_in_same_horizon: simplify
...
- array instead of list operations
- return array of index pairs instead of list objects
2025-10-25 13:35:26 +02:00
Robert Sachunsky
acee4c1bfe
find_number_of_columns_in_document: simplify
2025-10-25 13:35:18 +02:00
Robert Sachunsky
b2a79cc6ed
return_x_start_end_mothers_childs_and_type_of_reading_order: fix+1
...
when calculating `reading_order_type`, upper limit on column range
(`x_end`) needs to be `+1` here as well
2025-10-25 13:35:12 +02:00
Robert Sachunsky
e2dfec75fb
return_x_start_end_mothers_childs_and_type_of_reading_order:
...
simplify and document
- simplify
- rename identifiers to make readable:
- `y_sep` → `y_mid` (because the cy gets passed)
- `y_diff` → `y_max` (because the ymax gets passed)
- array instead of list operations
- add docstring and in-line comments
- return (zero-length) numpy array instead of empty list
2025-10-25 13:35:06 +02:00
Robert Sachunsky
0fc4b2535d
return_boxes_of_images_by_order_of_reading_new: fix no-mother case
...
- when handling lines without mother,
and biggest line already accounts for all columns,
but some are too close to the top and therefore must be removed,
avoid invalidating `biggest` index, causing `IndexError`
- remove try-catch (now unnecessary)
- array instead of list operations
2025-10-25 13:34:58 +02:00
Robert Sachunsky
7c3e418588
return_boxes_of_images_by_order_of_reading_new: simplify
...
- enumeration instead of indexing
- array instead of list operations
- add better plotting (but commented out)
2025-10-25 13:34:52 +02:00
Robert Sachunsky
cd35241e81
find_number_of_columns_in_document: split headings at top+baseline
...
regarding `splitter_y` result, for headings, instead of cutting right
through them via center line, add their toplines and baselines as if
they were horizontal separators
2025-10-25 13:34:35 +02:00
vahidrezanezhad
6192e5ba5c
qualitative evaluation of ocr models are added to docs
2025-10-23 16:37:24 +02:00
kba
ec1fd93dad
wip
2025-10-23 11:58:23 +02:00
vahidrezanezhad
d0ad7a98b7
starting qualitative ocr evaluation
2025-10-22 22:45:22 +02:00
vahidrezanezhad
7b7714af2e
completing ocr evaluations metric
2025-10-22 22:42:37 +02:00
vahidrezanezhad
b56bb44284
providing ocr model evaluation metrics
2025-10-22 21:30:06 +02:00
vahidrezanezhad
59eb4fd3be
images with ro are added to readme
2025-10-22 19:04:01 +02:00
vahidrezanezhad
ab9ddd5214
OCR examples are added to README
2025-10-22 18:41:15 +02:00
vahidrezanezhad
2fc723d292
extend README
2025-10-22 18:29:14 +02:00
kba
874cfc247f
.
2025-10-22 17:56:18 +02:00
kba
883546a6b8
eynollah models package
2025-10-22 17:05:40 +02:00
kba
04bc4a63d0
reorganize model_zoo
2025-10-22 16:04:48 +02:00
kba
d94285b3ea
rewrite model spec data structure
2025-10-22 13:07:35 +02:00
kba
146658f026
eynollah layout: fix trocr_processor model_zoo call
2025-10-22 10:48:26 +02:00
kba
4c8abfe19c
eynollah_ocr: actually replace the model calls
2025-10-22 10:48:26 +02:00
kba
1337461d47
adopt image_enhancer to the zoo
2025-10-21 19:24:55 +02:00
kba
f0c86672f8
adopt mb_ro_on_layout to the zoo
2025-10-21 17:55:08 +02:00
kba
bcffa2e503
adopt binarizer to the zoo
2025-10-21 17:53:24 +02:00
kba
de34a15809
Makefile: fix make models for OCR
2025-10-21 17:27:16 +02:00
kba
9d2b18d2af
test_run: check log messages starting with eynollah
2025-10-21 13:29:55 +02:00
kba
a53d5fc452
update docs/makefile to point to v0.6.0 models
2025-10-21 13:15:57 +02:00
kba
c6b863b13f
typing and asserts
2025-10-21 12:05:27 +02:00
kba
44b75eb36f
cli: model -> model_basedir
2025-10-21 11:05:12 +02:00
cneud
7d70835d22
small fixes to main readme
2025-10-20 23:19:10 +02:00
cneud
230e7cc705
integrate ocrd docs
2025-10-20 22:52:54 +02:00
cneud
e5254dc6c5
integrate training docs
2025-10-20 22:39:54 +02:00
cneud
6e3399fe7a
combine Docker docs
2025-10-20 22:16:56 +02:00
kba
062f317d2e
Introduce model_zoo to Eynollah_ocr
2025-10-20 21:14:52 +02:00
kba
d609a532bf
organize imports mostly
2025-10-20 19:46:07 +02:00
kba
48d1198d24
move Eynollah_ocr to separate module
2025-10-20 19:15:31 +02:00
kba
b90cfdfcc4
adapt tests to -l being top-level option now
2025-10-20 18:56:24 +02:00