eynollah

mirror of https://github.com/qurator-spk/eynollah.git synced 2026-02-21 00:41:56 +01:00

Author	SHA1	Message	Date
vahidrezanezhad	f9695cd7be	Merge branch 'adding-cnn-rnn-training-script' of https://github.com/qurator-spk/eynollah into adding-cnn-rnn-training-script	2026-01-28 11:52:36 +01:00
vahidrezanezhad	3500167870	weights ensembling for tensorflow models is integrated	2026-01-28 11:52:12 +01:00
vahidrezanezhad	33f6a231bc	fix: prevent crash when printspace is missing in xmls used for label generation	2026-01-26 17:30:26 +01:00
vahidrezanezhad	6ae244bf9b	Fix filename stem extraction using binarization. Restore the CNN-RNN model to its previous version, as setting channels_last alone was insufficient for running on both CPU and GPU. Prevent errors caused by null values in image shape elements.	2026-01-26 15:04:47 +01:00
vahidrezanezhad	30f39e7383	mapregion is added to labels	2026-01-26 13:56:34 +01:00
vahidrezanezhad	c8240905a8	Fix label generation by selecting largest contour when erosion splits shapes	2026-01-26 13:36:24 +01:00
Robert Sachunsky	3c3effcfda	drop TF1 vernacular, relax TF/Keras and Torch requirements… - do not restrict TF version, but depend on tf-keras and set `TF_USE_LEGACY_KERAS=1` to avoid Keras 3 behaviour - relax Numpy version requirement up to v2 - relax Torch version requirement - drop TF1 session management code - drop TF1 config in favour of TF2 config code for memory growth - training.*: also simplify and limit line length - training.train: always train with TensorBoard callback	2026-01-20 11:34:02 +01:00
Robert Sachunsky	e2754da4f5	adapt to Numpy 1.25 changes… (esp. `np.array(...)` now not allowed on ragged arrays unless `dtype=object`, but then coercing sub-arrays to `object` as well)	2026-01-20 04:04:07 +01:00
kba	9ccc495b4a	wip	2025-12-19 14:57:10 +01:00
vahidrezanezhad	49261fa99b	CNN–RNN–OCR inference and adaptation of the CNN–RNN–OCR model to support inference on both CPU and GPU	2025-12-17 15:12:39 +01:00
vahidrezanezhad	6ee79c7320	evaluation with a given GT is only possible for segmentation tasks	2025-12-17 13:28:02 +01:00
vahidrezanezhad	4651000191	debuging input shape + enable finetuning a model	2025-12-15 11:36:09 +01:00
vahidrezanezhad	4fc3ff33cb	The cnn-rnn ocr model can be trained now	2025-12-09 17:22:12 +01:00
vahidrezanezhad	84a72a128b	cnn-rnn model can be called - model input height and width are dynamic now - data generator is also callable	2025-12-09 15:30:19 +01:00
vahidrezanezhad	59e5a73654	adding cnn-rnn training script	2025-12-08 19:30:57 +01:00
vahidrezanezhad	7bf5e077d9	Restore correct execution of export_textline_images_and_text	2025-12-03 15:40:52 +01:00
vahidrezanezhad	6ac37af2f8	Fix eynollah ocr --help so it works again	2025-12-03 14:11:47 +01:00
vahidrezanezhad	d687d862d6	Restored correct functionality of the extract_only_images mode and cleaned up the argument handling	2025-12-03 12:01:42 +01:00
Robert Sachunsky	9fdae72e96	utils_ocr.return_textline_contour: gen cv2-like contours (w/ ndim=3, as in all other places)	2025-12-03 03:04:46 +01:00
Robert Sachunsky	ad8f8167c2	separate_lines/_vertical: gen cv2-like contours (w/ ndim=3, as in all other places)	2025-12-03 00:58:26 +01:00
Robert Sachunsky	43a95842bd	writer: also ensure validity after scaling	2025-12-02 16:35:32 +01:00
kba	51abe9617a	log to STDERR not STDOUT	2025-12-02 15:00:33 +01:00
Robert Sachunsky	56e73bf72f	deskewing: add a 2nd stage for precision after selecting the optimum angle on the original search range, narrow down around in the vicinity with half the range (adding computational costs, but gaining precision)	2025-11-28 18:27:58 +01:00
Robert Sachunsky	adcea47bc0	return_boxes_of_images_by_order_of_reading_new: always erode when passing the text region mask, do not apply erosion only if there are more than 2 columns, but iff `not erosion_hurts` (consistent with `find_num_col`'s expectations and making it as easy to find the column gaps on 1 and 2-column pages as on multi-column pages)	2025-11-28 18:23:59 +01:00
Robert Sachunsky	5a3de3b42d	column detection: improve, aided by vseps whenever possible - `find_number_of_columns_in_document`: retain vertical separators and pass to `find_num_col` for each vertical split - `return_boxes_of_images_by_order_of_reading_new`: reconstruct the vertical separators from the segmentation mask and the separator bboxes; pass it on to `find_num_col` everywhere - `return_boxes_of_images_by_order_of_reading_new`: no need to try-catch `find_num_col` anymore - `return_boxes_of_images_by_order_of_reading_new`: when a vertical split has too few columns, * do not raise but lower the threshold `multiplier` responsible for allowing gaps as column boundaries * do not pass the `num_col_classifier` (i.e. expected number of resulting columns) of the entire page to the iterative `find_num_col` for each existing column, but only the portion of that span	2025-11-28 18:14:24 +01:00
Robert Sachunsky	4dd40c542b	find_num_col: add optional criterion - sum of vertical separators when searching for gaps between text regions, consider the vertical separator mask (if given): add the vertical sum of vertical separators to the peak scores (making column detection more robust if still slighly skewed or partially obscured by multi-column regions, but fg seps are present)	2025-11-28 18:07:15 +01:00
Robert Sachunsky	84d10962f3	return_boxes_of_images_by_order_of_reading_new: improve - when searching for multi-col box makers, pick the right-most allowable column, not the left-most	2025-11-28 18:04:12 +01:00
Robert Sachunsky	5abf0c1097	return_boxes_of_images_by_order_of_reading_new: improve - when analysing regions spanning across columns, disregard tiny regions (smaller than half the median size) - if a region spans across columns just by a tiny fraction, and therefore is not good enough for a multi-col separator, then it should also not be good enough for a multi-col box maker	2025-11-28 17:58:44 +01:00
Robert Sachunsky	b71bb80e3a	return_boxes_of_images_by_order_of_reading_new: fix `4abc2ff5` (forgot to also flip `regions_with_separators` if right2left)	2025-11-28 17:57:10 +01:00
Robert Sachunsky	a527d7a10d	combine_hor_lines_and_delete_cross_points: improve - avoid unnecessary `fillPoly` (we already have the mask) - do not merge hseps if vseps interfere - remove old criterion (based on total length of hseps) - create new criterion (no x overlap and x close to each other) - rename identifiers: * `sum_dis` → `sum_xspan` * `diff_max_min_uniques` → `tot_xspan` * np.std / np.mean → `dev_xspan` - remove rule cutting around the center of crossing seps (which is unnecessary and creates small isolated seps at the center, unrelated to the actual crossing points) - create rule cutting hseps by vseps _prior_ to merging	2025-11-28 17:34:11 +01:00
Robert Sachunsky	5c12b6a851	combine_hor_lines_and_delete_cross_points: simplify and rename - `x_width_smaller_than_acolumn_width` → `avg_col_width` - `len_lines_bigger_than_x_width_smaller_than_acolumn_width` → `nseps_wider_than_than_avg_col_width` - `img_in_hor` → `img_p_in_hor` (analogous to vertical)	2025-11-28 17:27:12 +01:00
Robert Sachunsky	06cb9d1d31	combine_hor_lines_and_delete_cross_points: fix 1-off px bug when eroding the vertical separator mask (by slicing), avoid leaving 1px strips	2025-11-28 17:08:39 +01:00
Robert Sachunsky	38d91673b1	combine_hor_lines_and_delete_cross_points: get external contours instead of tree without looking at the actual hierarchy (to prevent retrieving holes as separators)	2025-11-28 16:50:08 +01:00
Robert Sachunsky	ee59a6809d	contours_in_same_horizon: fix `5d15941b`	2025-11-28 16:17:09 +01:00
kba	b161e33854	🔥 refactor eynollah ocr .	2025-11-28 15:45:21 +01:00
kba	30f9c695dc	move line-gt extraction out of ocr to eynollah-training	2025-11-28 15:12:31 +01:00
kba	951bd2fce6	CI: do not upgrade (now-unpineed) torch	2025-11-28 15:12:31 +01:00
kba	9bcfeab057	💀 remove dead code from eynollah.py	2025-11-28 12:52:28 +01:00
kba	5171e09c2d	eynollah.py: fix kwargs to writer	2025-11-28 12:52:28 +01:00
kba	c24cf94bce	enforce kwargs for writer.build_...	2025-11-28 12:52:28 +01:00
kba	4aa9543a7d	remove more branches after textline_light default true	2025-11-27 11:30:00 +01:00
kba	177d555ded	factor out extract_only_images as eynollah extract-images	2025-11-26 21:37:00 +01:00
kba	83e8b289da	🔥 drop light_version/textline_light (now default and implied)	2025-11-26 20:48:22 +01:00
kba	ca83cf934d	fix imports from src/cli/cli_/_cli	2025-11-26 20:48:14 +01:00
kba	095b36c389	models: split into layout, extra and ocr layout: Everything not OCR or extra ocr: trocr/cnnrnn models extra: obsolete or niche models	2025-11-26 19:49:59 +01:00
kba	000af16a47	🔥 remove torch pinning	2025-11-26 19:23:49 +01:00
kba	e503c1a0b7	drop obsolete multi-model binarization	2025-11-26 18:51:41 +01:00
kba	82266f8234	reorganize cli	2025-11-26 18:51:20 +01:00
kba	5a1900e664	🔥 remove OCR option from eynollah layout	2025-11-26 18:12:03 +01:00
kba	0f410c2e7c	disable tf/keras logging on first import	2025-11-26 16:37:54 +01:00

1 2 3 4 5 ...

1359 commits