Commit graph

1156 commits

Author SHA1 Message Date
kba
733af1e9a7 📝 update train/README.md, align with docs/train.md 2025-10-01 17:43:32 +02:00
vahidrezanezhad
5725e4fd1f -Continue processing when num_col is None but textregions exist. -Convert marginal-only to main body if no main body is present. -Reset deskew angle to 0 when text region density (textregion area to page area) < 0.3 and angle > 45°. 2025-10-01 15:58:03 +02:00
cneud
4514d417a7 force GH markdown code block in list 2025-10-01 01:16:25 +02:00
cneud
e027bc038e Update README.md 2025-10-01 01:05:15 +02:00
cneud
91d2a74ac9 remove redundant parentheses 2025-10-01 00:38:01 +02:00
cneud
f2f93e0251 list literal is faster than using list constructor to create a new list 2025-10-01 00:26:27 +02:00
cneud
70af00182b mutable defaults are the source of all evil 2025-10-01 00:20:18 +02:00
cneud
1d0616eb69 comparisons to None should not use the equality operators 2025-10-01 00:15:11 +02:00
cneud
9ce127eb51 remove unnecessary backslash 2025-10-01 00:04:53 +02:00
cneud
558867eb24 fix typo 2025-10-01 00:04:07 +02:00
Robert Sachunsky
3aa7ad04fa 📝 update changelog 2025-09-30 23:14:52 +02:00
Robert Sachunsky
f0de1adabf rm loky dependency 2025-09-30 23:12:18 +02:00
Robert Sachunsky
7daec392b9 Dockerfile: fix up CUDA installation for mixed TF/Torch 2025-09-30 22:10:45 +02:00
Robert Sachunsky
ad129ed46c CI: remove OS from model cache keys 2025-09-30 22:05:53 +02:00
Robert Sachunsky
c86e59f481 CI: update model key, split up cache restore/save 2025-09-30 22:03:46 +02:00
Robert Sachunsky
a3d8197930 makefile: update model URL 2025-09-30 21:50:21 +02:00
Robert Sachunsky
61b20cc83d tests: switch from subtests to parametrize, use --isolate everywhere to free CUDA memory in between 2025-09-30 19:20:35 +02:00
Robert Sachunsky
375e0263d4 CNN-RNN OCR model: switch to 20250930 version (compatible with TF 2.12 on CPU as well) 2025-09-30 19:16:50 +02:00
Robert Sachunsky
b21051db21 ProcessPoolExecutor: shutdown during del() instead of atexit() 2025-09-30 19:16:00 +02:00
Robert Sachunsky
08c8c26028 indent extremely long lines 2025-09-30 03:52:19 +02:00
Robert Sachunsky
f857ee7b51 simplify 2025-09-30 02:26:00 +02:00
Robert Sachunsky
c0137c29ad try to fix the failed outsourcing of utils_ocr 2025-09-30 02:23:43 +02:00
Robert Sachunsky
13f85b0d5c Merge branch 'main' into loky-with-shm-for-175-rebuilt 2025-09-30 02:07:20 +02:00
cneud
070dafca75 remove duplicate LICENSE 2025-09-29 22:17:27 +02:00
cneud
53c1ca11fc Update README.md 2025-09-29 22:15:17 +02:00
Robert Sachunsky
758602403e replace loky with concurrent.futures.ProcessPoolExecutor (faster) 2025-09-29 17:48:22 +02:00
Robert Sachunsky
0366707136 get_smallest_skew: do not pass logger 2025-09-29 17:48:22 +02:00
Robert Sachunsky
b94c96fcbb find_num_col: exit early if empty (avoiding exceptions) 2025-09-29 17:48:22 +02:00
Robert Sachunsky
04c3d7dd1b get_smallest_skew: avoid shm if no ProcessPoolExecutor is passed 2025-09-29 17:48:22 +02:00
Robert Sachunsky
0662ece536 do_work_of_slopes*: use shm also in non-light mode(s) 2025-09-29 17:48:22 +02:00
Robert Sachunsky
31f240c3b8 do_image_rotation, do_work_of_slopes_new_curved: pass arrays via shared memory 2025-09-29 17:48:22 +02:00
Robert Sachunsky
8be2c79771 Revert "deskewing with faster multiprocessing"
This reverts commit 5db3e9fa64.
2025-09-29 17:48:22 +02:00
Robert Sachunsky
abf5c0f845 get_smallest_skew: when shifting search range of rotation angle, compare resulting (maximum) variances instead of blindly assuming the new range is better 2025-09-29 17:48:22 +02:00
Robert Sachunsky
dc0caad512 writer: use @type='heading' instead of 'header' 2025-09-29 17:48:22 +02:00
Robert Sachunsky
f458e3ece0 writer: SeparatorRegion needs SeparatorRegionType (not ImageRegionType) 2025-09-29 17:48:22 +02:00
Robert Sachunsky
4337d62985 contours: rename 'pixel' → 'label' for clarity 2025-09-29 17:48:22 +02:00
Robert Sachunsky
5b16c2fc00 avoid pulling unused 'image_page_rotated' through functions 2025-09-29 17:48:22 +02:00
Robert Sachunsky
5bff2d156a use box2rect instead of crop_image_inside_box when no image needed 2025-09-29 17:48:22 +02:00
Robert Sachunsky
9b5182c1c0 utils: introduce box2rect and box2slice 2025-09-29 17:48:19 +02:00
Robert Sachunsky
bca2ae3d78 get_marginals: exit early if no peaks found to avoid spurious overlap mask 2025-09-29 17:47:51 +02:00
Robert Sachunsky
235539a350 filter_contours_without_textline_inside: avoid removing from identical lists twice 2025-09-29 17:47:51 +02:00
Robert Sachunsky
11e143afee polygon2contour: avoid overflow 2025-09-29 17:47:51 +02:00
Robert Sachunsky
7a9e8256ee increase dilatation: textregions/lines (5→6), seplines (0→1) 2025-09-29 17:47:51 +02:00
Robert Sachunsky
f3faa29528 refactor shapely converisons into contour2polygon / polygon2contour, also handle heterogeneous geometries 2025-09-29 17:47:51 +02:00
Robert Sachunsky
0650274ffa move dilate_*_contours to .utils.contour, rename dilate_textregions_contours_textline_version → dilate_textline_contours 2025-09-29 17:47:47 +02:00
Robert Sachunsky
a433c73628 filter_contours_area_of_image*: also ensure validity here 2025-09-29 17:46:50 +02:00
Robert Sachunsky
17bcf1af71 rename *lines_xml → *seplines for clarity 2025-09-29 17:46:50 +02:00
Robert Sachunsky
e730725da3 check_any_text_region_in_model_one_is_main_or_header_light: return original instead of resampled contours 2025-09-29 17:46:50 +02:00
Robert Sachunsky
7b51fd6624 avoid creating invalid polygons via rounding 2025-09-29 17:46:50 +02:00
Robert Sachunsky
41cc38c51a get_textregion_contours_in_org_image_light: no back rotation, drop slope_first (always 0) 2025-09-29 17:46:48 +02:00