eynollah

mirror of https://github.com/qurator-spk/eynollah.git synced 2026-05-26 07:39:22 +02:00

Author	SHA1	Message	Date
kba	d2aae35446	.	2026-04-28 15:39:53 +02:00
kba	d705f855f1	.	2026-04-28 15:36:50 +02:00
kba	abdcb1a1f9	.	2026-04-28 15:33:57 +02:00
kba	69280187c5	.	2026-04-28 15:29:48 +02:00
kba	1ba82ede88	.	2026-04-28 15:25:36 +02:00
kba	be1296150c	.	2026-04-28 15:07:33 +02:00
kba	4899a8fa17	.	2026-04-28 14:59:01 +02:00
kba	29ef9f09dc	.	2026-04-28 14:53:13 +02:00
kba	511222704e	.	2026-04-28 14:51:23 +02:00
kba	5c6e075975	Merge branch 'ocrd-wrappers' of https://github.com/qurator-spk/eynollah into ocrd-wrappers	2026-04-28 14:31:24 +02:00
kba	1ae862cf52	.	2026-04-28 14:31:15 +02:00
kba	a9e12a63da	wp	2026-04-28 12:18:29 +02:00
kba	957dc66e7c	organize ocrd-eynollah-segment like ocrd-sbb-binarize	2026-04-27 18:50:54 +02:00
Robert Sachunsky	bb092364af	get_slopes_and_deskew_new_light2: estimate slopes here, too… extract slopes from minimal bounding rectangles of textlines, using heuristics on aspect ratios, lengths and angles	2026-04-24 15:27:29 +02:00
Robert Sachunsky	c478c03db4	avoid deskewed contour matching w/ -romb	2026-04-24 15:27:29 +02:00
Robert Sachunsky	998ee2ecee	get_textlines_of_a_textregion_sorted: simplify	2026-04-24 15:27:29 +02:00
Robert Sachunsky	be61875d6e	get_textlines_of_a_textregion_sorted: w-h instead of w/h test	2026-04-24 15:27:29 +02:00
Robert Sachunsky	9723dfeb73	writer: also annotate col-classifier result… both notations: - in `/PcGts/Page/@custom` (CSS-style) - in `/PcGts/Metadata/Comment` (qurator-style)	2026-04-24 15:27:29 +02:00
Robert Sachunsky	e3720d6623	writer: also annotate page-level deskewing result	2026-04-24 15:27:29 +02:00
Robert Sachunsky	2da718f76f	writer, do_work_of_slopes*: drop passing bboxes around (needed no more)	2026-04-24 15:27:29 +02:00
Robert Sachunsky	b792324c5b	do_work_of_slopes_new_curved (if angle >45°): simplify, improve… - use new `rotate_image_enlarge` instead of custom (insufficient) padding w/ `rotate_image` - get external contours instead of tree (without checking hierarchy afterwards) - use largest textline contours by area instead of longest polygon path - always use `separate_lines` (but without its incorrect angle/offset calculations) instead of `separate_lines_vertical_cont` - calculate coordinate transformation (shift, angle) for all cases (including >45°) - simplify	2026-04-24 15:27:29 +02:00
Robert Sachunsky	dbdb6d0d53	rotate: rm unused failed variants, add new `rotate_image_enlarge`… (correct version that enlarges canvas instead of clipping corners, using only OpenCV)	2026-04-24 15:27:29 +02:00
Robert Sachunsky	d257869d83	do_work_of_slopes_new_curved (if angle <45°): simplify, improve… - use relative images, cropped to parent bbox (faster) - no `scale` parameter (unused) - use largest textline contours by area instead of first - simplify	2026-04-24 15:27:29 +02:00
Robert Sachunsky	0dce1f24d2	do_work_of_slopes_new_curved: improve deskewing… - return early if textline mask is empty - intersect textline mask with parent mask (so neighbouring, truncated textlines will not interfere) - fix bug when resulting angle is small: rather, compare with page angle - if there is more than 1 line in the region, * use median instead of mean to estimate y_diff * if height dominates over width and x_diff over y_diff, then assume 90°: transpose image, deskew on that, then add 90° to result - otherwise instead of just using page angle, try to estimate single-line angle by approximating slope of linear x-y regression on mask image; again, if height dominates over width, then assume +90° and use transposed image - drop unused `scale` param	2026-04-24 15:27:29 +02:00
Robert Sachunsky	97d9b0ea50	small_textlines_to_parent_adherence2: simplify, improve… - when merging large line with small lines, don't use first new contour but largest - get external contours instead of tree (without checking hierarchy afterwards) - simplify	2026-04-24 15:27:29 +02:00
Robert Sachunsky	0735cb9d2b	filter_contours_without_textline_inside: also filter slopes	2026-04-24 15:27:29 +02:00
Robert Sachunsky	fa8340dbb4	-cl: also filter textregions without textlines here	2026-04-24 15:27:29 +02:00
Robert Sachunsky	4a6d3968f9	major `run_single` refactoring… - rename `get_regions()` → `get_early_layout()` - split up `run_boxes_no/full_layout()` into shared * `get_full_layout()` (for lapping mapping, table decoding and optional full model prediction) * `get_deskewed_masks()` (for de-rotation) * extraction of various region types (polygons and confidences) * `run_boxes_order()` (for column detection and box ordering) - rename `contours_tables` → `polygons_of_tables` This further reduces redundant code, avoids splitting up the same functionality across different places depending on mode etc.	2026-04-24 15:27:29 +02:00
Robert Sachunsky	dfb40f4a49	hsep fusion: avoid zero division if zero overlap	2026-04-24 15:27:29 +02:00
Robert Sachunsky	b63e073121	skip deskewing if no textlines	2026-04-24 15:27:29 +02:00
Robert Sachunsky	7b5aa2a1f6	more `run_single` refactoring… - `run_single`: re-use `return_contours_of_interested_region` for extraction and filtering of text region contours - `run_single`: isolate new function `match_deskewed_contours` - `run_single`: apply dilation afterwards - rename `contours_only_text_parent_d_ordered` → `polygons_of_textregions_d` - rename `contours_only_text_parent` → `polygons_of_textregions` - rename `contours_only_text_parent_h` → `polygons_of_textregions_h` - `do_work_of_slopes_new_curved` and `get_slopes_and_deskew_new_curved`: no need for `mask_texts_only` array arg - `filter_contours_inside_a_bigger_one`: no need for `image` as array arg, simplify - `split_textregion_main_vs_head`: simplify, re-order arguments and return tuple logically - if no main text regions are found, just convert marginals to main text and continue normally instead of stopping early w/ empty marginals (i.e. no textlines)	2026-04-24 15:27:29 +02:00
Robert Sachunsky	a2f43b8d69	simplify, add confidence for headings as well	2026-04-23 21:14:39 +02:00
Robert Sachunsky	264b00f8ab	predictor: cache models' input shape instead of output shape	2026-04-23 21:14:39 +02:00
Robert Sachunsky	829256df91	do_prediction*: remove autosized variants, simplify	2026-04-23 21:14:39 +02:00
Robert Sachunsky	de65a55a04	mbro: simplify, add drop-caps as well, reduce batch size… - do_order_of_regions_with_model: * add `polygons_of_drop_capitals`, order these indices as well (model was not trained for this, but it works) * explicit label identifiers instead of number literals * map marginals and images correctly * simplify (a lot) * reduce inference batch size to accomodate 8 GB VRAM GPUs - return_indexes_of_contours_located_inside_another_list_of_contours: simplify	2026-04-23 21:14:39 +02:00
Robert Sachunsky	0dfc9d911f	run_boxes_no_full_layout: also map to fl labels here… (because -mbro assumes the label set from -fl)	2026-04-20 18:20:58 +02:00
Robert Sachunsky	0015f2675b	with -slro, also extract and apply page (Border) mask	2026-04-20 18:20:58 +02:00
Robert Sachunsky	569b96d1a9	find_number_of_columns_in_document: pass correct label_seps… - in fl: 6 - non-fl: 3 (now fixed)	2026-04-20 18:20:58 +02:00
Robert Sachunsky	f28a9c9e0b	add confidence for all region types, prepare for textlines… - pass on probabilities from predicted class everywhere - rename `confidence_matrix` → `confidence_regions` / `regions_confidence` - rename `get_textregion_confidences()` → `get_region_confidences()` - add same for tables, textlines and regionsfl (full layout model) - aggregate per-region confidence lists for image, table, drop-capital, left marginal and right marginal regions - add in writer - simplify/re-indent some - try to replace more number literals with class label identifiers	2026-04-20 18:20:58 +02:00
Robert Sachunsky	1164b97917	extract_text_regions_new: fix heading thresholding… - re-introduce boosting `heading` thresholding broken when refactoring (light version and do_prediction) - also return confidence for full layout prediction	2026-04-20 18:20:58 +02:00
Robert Sachunsky	20dc5c3188	also cover drop-capital in (heuristic) reading order	2026-04-20 18:20:58 +02:00
Robert Sachunsky	92e94753c7	decoding of dropcaps in -fl: ensure consistency w/ early layout… 1. use connected component analysis to get unique segments in early prediction result 2. for each drop-capital segment in full prediction result, find matching early segment 3. when they have high overlap, assign drop-capital label to the entire early segment	2026-04-20 18:20:58 +02:00
Robert Sachunsky	29b42fdfaa	decoding of drop-capitals in full layout: also allow replacing img… - rename `putt_bb_of_drop_capitals_of_model_in_patches_in_layout` → `fill_bb_of_drop_capitals` - also allow image (besides text) label in early layout prediction result when checking if entire bbox can be filled (as opposed to just drop-capital \| image \| background mask) - simplify	2026-04-16 18:37:27 +02:00
Robert Sachunsky	6e0aed35f4	run_boxes_*: simplify, document class label mappings, start using identifier constants instead of literals for labels	2026-04-16 18:37:27 +02:00
Robert Sachunsky	f29e876a7c	return_boxes_of_images_by_order_of_reading_new: sep label differs w/o -fl… fix bug where in non-full mode, the wrong class label was assumed for separator regions (3 in non- vs 6 in full layout mode): - pass in separator mask instead of full segmentation map - rename for clarity: - `regions_without_separators` → `text_mask` (alread binary) - `regions_with_separators` → `sep_mask` (now just binary)	2026-04-16 05:16:23 +02:00
Robert Sachunsky	f5f2435a38	run_marginals: drop unnecessarily passing textline_mask, mask_seps, mask_images	2026-04-16 05:13:06 +02:00
Robert Sachunsky	9309586712	split_textregion_main_vs_header → split_textregion_main_vs_head… (and simplify)	2026-04-16 05:07:22 +02:00
Robert Sachunsky	0f82b568ba	do_prediction_new_concept: aggregate confidence for all classes… (not just text; will still have to pass that on to the writer...)	2026-04-16 05:02:20 +02:00
Robert Sachunsky	5a27e46b22	keep seps over artificial boundaries to improve col separation… (thresholding and decoding with artificial boundary class can overwrite existing column separators, which in turn can contribute to missing column boundaries; this prioritises seps over boundaries, which does not impair separation of instances, as seps will separate text/image/etc instances just as well as artificial boundaries)	2026-04-16 04:56:38 +02:00
Robert Sachunsky	9d6ff65e1d	get_tables_from_model: utilise artificial bound thresholding… (to improve separation of neighbouring tables, esp. across columns; since model's threshold class is particularly weak, also use lower threshold here)	2026-04-16 04:49:07 +02:00

1 2 3 4 5 ...

1477 commits