Commit graph

1342 commits

Author SHA1 Message Date
kba
b34329dd61 tests: more path fixes 2025-11-13 12:21:48 +01:00
kba
9aeff6d155 tests: typo 2025-11-13 11:49:09 +01:00
kba
a72be69958 tests: fix model download URL 2025-11-13 11:48:23 +01:00
kba
3afbce023d tests: adapt paths 2025-11-13 11:46:31 +01:00
vahidrezanezhad
ed5b5c13dd Add test images; call TrOCR processor from the same directory as the TrOCR model 2025-11-07 12:47:21 +01:00
kba
8732007aaf . 2025-11-06 16:33:39 +01:00
kba
f902756ce1 try importing torch, then shapely, then tensorflow 2025-11-06 13:10:35 +01:00
kba
44037bc05d add layout marginalia test 2025-11-06 12:42:57 +01:00
kba
d224b0f7e8 try with shapely.set_precision(...mode="keep_collpased") 2025-11-06 11:55:40 +01:00
kba
0d84e7da16 Merge remote-tracking branch 'origin/docs_and_minor_fixes' into model-zoo
# Conflicts:
#	README.md
#	train/README.md
2025-11-06 11:37:10 +01:00
kba
53e879e289 make *test: another typo; 2025-11-05 16:19:55 +01:00
kba
e449dbab6d make *test: fix paths 2025-11-05 15:28:41 +01:00
kba
0bef6e297b make models: unzip to the versioned directory 2025-11-05 15:19:16 +01:00
kba
2c211095d7 make deps-test should not depend on the models 2025-11-05 15:02:55 +01:00
kba
b6c7283b4d further debugging 2025-11-05 14:41:18 +01:00
cneud
f90259d6e2 fix docs links 2025-10-30 22:24:54 +01:00
cneud
d5b7089bad Merge branch 'docs_and_minor_fixes' of https://github.com/qurator-spk/eynollah into docs_and_minor_fixes 2025-10-30 22:17:41 +01:00
cneud
9dbac280cc Revert "remove unnecessary backslash"
This reverts commit f212ffa22d.
2025-10-30 22:16:53 +01:00
cneud
2d35a0598d Revert "replace list declaration with list literal (faster)"
This reverts commit 9733d575bf.
2025-10-30 22:16:48 +01:00
cneud
70d8577a15 Revert "remove redundant parentheses"
This reverts commit 20a95365c2.
2025-10-30 22:16:41 +01:00
Clemens Neudecker
c9efbe1871
refactor image layout in examples.md 2025-10-30 16:52:59 +01:00
kba
8782ef17b2 CI: 🔥 upgrade torch for debugging 2025-10-30 12:19:35 +01:00
kba
62d05917c5 test_layout: str(Path) 2025-10-30 12:17:38 +01:00
cneud
b1e191b2ea reformat cli options table 2025-10-29 22:30:58 +01:00
cneud
f6c0f56348 Update README.md 2025-10-29 22:23:56 +01:00
cneud
46a45f6b0e Create examples.md 2025-10-29 22:23:48 +01:00
kba
15e6ecb95d make models: update URL 2025-10-29 21:27:10 +01:00
kba
600ebfeb50 make: fix to use single-archive ZIP 2025-10-29 21:07:49 +01:00
kba
9ab565fa02 model basedir might be a symlink 2025-10-29 21:02:42 +01:00
kba
4772fd17e2 missed changing override mechanism in eynollah_ocr 2025-10-29 20:47:13 +01:00
kba
29c273685f fix merge issues 2025-10-29 20:15:19 +01:00
kba
de76eabc1d Merge branch 'cli-logging' into model-zoo 2025-10-29 19:41:01 +01:00
kba
5e22e9db64 model_zoo: make type str to reduce importing overhead 2025-10-29 19:16:35 +01:00
kba
a913bdf7dc make --model-basedir and --model-overrides top-level CLI options 2025-10-29 18:48:41 +01:00
kba
b6f82c72b9 refactor cli tests 2025-10-29 17:23:21 +01:00
cneud
22d61e8d94 remove newspaper images from main readme 2025-10-28 19:56:23 +01:00
cneud
8822da17cf Merge remote-tracking branch 'origin/updating_docs' into docs_and_minor_fixes 2025-10-28 19:53:12 +01:00
kba
ef999c8f0a Merge branch 'model-zoo' of lx0145.sbb.spk-berlin.de:/data/eynollah into model-zoo 2025-10-27 11:45:20 +01:00
kba
294b6356d3 wip 2025-10-27 11:45:16 +01:00
kba
51d2680d9c wip 2025-10-27 11:44:59 +01:00
Robert Sachunsky
19b2c3fa42 reading order: improve handling of headings and horizontal seps
- drop connected components analysis to test overlaps between
  horizontal separators and (horizontal) neighbours (introduced
  in ab17a927)
- instead of converting headings to topline and baseline during
  `find_number_of_columns_in_document` (introduced in 9f1595d7),
  add them to the matrix unchanged, but mark as extra type
  (besides horizontal and vertical separtors)
- convert headings to toplines and baselines no earlier than in
  `return_boxes_of_images_by_order_of_reading_new`
- for both headings and horizontal separators, if they already
  span multiple columns, check if they would overlap (horizontal)
  neighbours by looking at successively larger (left and right)
  intervals of columns (and pick the largest elongation which
  does not introduce any overlaps)
2025-10-25 13:36:35 +02:00
Robert Sachunsky
3367462d18 return_boxes_of_images_by_order_of_reading_new: change arg order 2025-10-25 13:36:24 +02:00
Robert Sachunsky
a2a9fe5117 delete_separator_around: simplify, eynollah: identifiers
- use array instead of list operations
- rename identifiers:
  - `pixel` → `label`
  - `line` → `sep`
2025-10-25 13:36:17 +02:00
Robert Sachunsky
3ebbc2d693 return_boxes_of_images_by_order_of_reading_new: indent
(by removing unnecessary conditional)
2025-10-25 13:36:06 +02:00
Robert Sachunsky
66a0e55e49 return_boxes_of_images_by_order_of_reading_new: avoid oversplits
when y slice (`top:bot`) is not a significant part of the page,
viz. less than 22% (as in `find_number_of_columns_in_document`),
avoid forcing `find_num_col` to reach `num_col_classifier`

(allows large headers not to be split up and thus better ordered)
2025-10-25 13:35:56 +02:00
Robert Sachunsky
6fbb5f8a12 return_boxes_of_images_by_order_of_reading_new: simplify
- array instead of list operations
- add better plotting (but commented out)
- add more debug printing (but commented out)
- add more inline comments for documentation
- rename identifiers to make more readable:
  - `cy_hor_diff` → `y_max_hor_some` (because the ymax gets passed)
  - `lines` → `seps`
  - `y_type_2` → `y_mid`
  - `y_diff_type_2` → `y_max`
  - `y_lines_by_order` → `y_mid_by_order`
  - `y_lines_without_mother` → `y_mid_without_mother`
  - `y_lines_with_child_without_mother` → `y_mid_with_child_without_mother`
  - `y_column` → `y_mid_column`
  - `y_column_nc` → `y_mid_column_nc`
  - `y_all_between_nm_wc` → `y_mid_between_nm_wc`
  - `lines_so_close_to_top_separator` → `seps_too_close_to_top_separator`
  - `y_in_cols` and `y_down` → `y_mid_next`
- use `pairwise()` `nc_top:nc_bot` instead of `i_c` indexing
2025-10-25 13:35:44 +02:00
Robert Sachunsky
6cc5900943 find_num_col: add better plotting (but commented out) 2025-10-25 13:35:34 +02:00
Robert Sachunsky
5d15941b35 contours_in_same_horizon: simplify
- array instead of list operations
- return array of index pairs instead of list objects
2025-10-25 13:35:26 +02:00
Robert Sachunsky
acee4c1bfe find_number_of_columns_in_document: simplify 2025-10-25 13:35:18 +02:00
Robert Sachunsky
b2a79cc6ed return_x_start_end_mothers_childs_and_type_of_reading_order: fix+1
when calculating `reading_order_type`, upper limit on column range
(`x_end`) needs to be `+1` here as well
2025-10-25 13:35:12 +02:00