dinglehopper

mirror of https://github.com/qurator-spk/dinglehopper.git synced 2026-07-29 15:02:33 +02:00

Author	SHA1	Message	Date
Benjamin Rosemann	40f23b8482	Added comments	2021-06-14 12:29:34 +02:00
Benjamin Rosemann	cee7b6891b	Fix CI Build	2021-06-12 10:40:23 +02:00
Benjamin Rosemann	714b569195	Fixed some flake8 and mypy issues.	2021-06-11 16:09:19 +02:00
Benjamin Rosemann	a44a3d4bf2	Error handling	2021-06-11 15:33:13 +02:00
Benjamin Rosemann	06468a436e	Implemented new metrics behaviour	2021-06-11 15:08:45 +02:00
Benjamin Rosemann	9f5112f8f6	Remove support for ExtractedText for bag metrics.	2021-06-11 10:23:26 +02:00
Benjamin Rosemann	381fe7cb6b	Switch to result tuple instead of multiple return parameters	2021-06-11 10:21:23 +02:00
Benjamin Rosemann	974ca3e5c0	Split html and json report generation	2021-06-11 09:35:26 +02:00
Benjamin Rosemann	8cd624f795	Add BoC and BoW metric Also some refactoring for helper methods on normalization and word splitting.	2021-06-08 17:41:44 +02:00
Benjamin Rosemann	4ccae9432d	Move metrics into separate package	2021-05-27 16:37:34 +02:00
Gerber, Mike	249787686f	Merge branch 'master' of github.com:qurator-spk/dinglehopper Some checks failed continuous-integration/drone/push Build is failing Details	2021-05-20 09:42:15 +02:00
Gerber, Mike	2a6cc5823e	🐛 dinglehopper: Call initLogging before logging When using ocrd_utils' getLogger(), we need to call initLogging() before doing any logging. Fixes #55.	2021-05-20 09:39:09 +02:00
Konstantin Baierer	7fde00d911	ReadingOrder may also contain UnorderedGroupIndexed	2021-05-18 17:34:08 +02:00
Gerber, Mike	1778b36a9a	🚧 dinglehopper: Read PAGE UnorderedGroup in XML order	2021-04-15 21:09:45 +02:00
Benjamin Rosemann	a68fc269d9	Fix the extraction of text from Page with TableRegion Dinglehopper did not consider `OrderedGroupIndex` in the `ReadingOrder` element when extracting text regions. As a consequence a `TableRegion` was not considered for text extraction.	2020-11-27 11:18:11 +01:00
Konstantin Baierer	74e0ac18ed	ocrd cli: use core-provided zip_input_files method	2020-11-19 16:00:28 +01:00
Gerber, Mike	389e253c11	🐛 dinglehopper: Fix alto_extract_lines()'s type annotation	2020-11-12 19:32:38 +01:00
Gerber, Mike	fe3923a8af	🐛 dinglehopper: Fix alto_extract()'s type annotation	2020-11-12 19:19:05 +01:00
Gerber, Mike	132f91d500	✔️ dinglehopper: Add missing integration test markers	2020-11-12 19:10:23 +01:00
Benjamin Rosemann	ce752e1912	Remove .idea folder and modify .gitignore Sharing even parts of the .idea folder in worldwide setting is bound to generate more problems than solutions. Therefore it should be removed and consequently ignore in .gitignore. Also adds some Python specific stuff to the .gitignore file.	2020-11-11 11:36:17 +01:00
Benjamin Rosemann	5270737c1f	Skip test on windows because it is unix specific.	2020-11-11 11:36:17 +01:00
Gerber, Mike	32a4b95a99	🐛 dinglehopper: Normalize in plain_extract()	2020-11-10 18:51:14 +01:00
Gerber, Mike	14421c8e53	🎨 dinglehopper: Reformat using black	2020-11-10 12:29:55 +01:00
Gerber, Mike	31c63f9e4c	🎨 dinglehopper: s/LOG/log	2020-11-09 16:55:43 +01:00
Robert Sachunsky	a60c14351e	1 more update for core's getLogger context	2020-11-03 17:46:59 +01:00
Benjamin Rosemann	c02569b41e	Fix f-strings for Python 3.5	2020-10-29 12:33:54 +01:00
Benjamin Rosemann	7b27b2834e	More complex sorting for text extraction When extracting text from TextEquiv nodes we may encounter nodes without index or nodes that should get sorted via the conf attribute. Therefore we added a more complex algorithm to extract a TextEquiv and inform the user via log messages if we encounter structures that we can handle but may produce unexpected results.	2020-10-29 10:03:40 +01:00
Benjamin Rosemann	6ff831dfd2	Sort textlines with missing indices Python's `sorted` method will fail with a TypeError when called with `None` and Integers: ```python >>> sorted([None, 1]) TypeError: '<' not supported between instances of 'int' and 'NoneType' ``` Therefore we are using `float('inf')` instead of `None` in case of missing textline indices.	2020-10-29 10:03:40 +01:00
Gerber, Mike	5cbe148741	🐛 dinglehopper: Skip pages if there is no GT nor OCR (Fixes GH-34)	2020-10-21 19:29:45 +02:00
Gerber, Mike	e4e2777cb7	🐛 dinglehopper: Do try to get text when no TextEquivs exist	2020-10-21 17:59:44 +02:00
Gerber, Mike	1c88891a98	✔️ Add test data for LAREX's indexed TextEquivs (unused)	2020-10-21 17:51:15 +02:00
Gerber, Mike	19d15e3ecc	🐛 dinglehopper: Honor TextEquiv index (Closes GH-33)	2020-10-21 17:50:21 +02:00
Gerber, Mike	f626a2ebe6	🧹 dinglehopper: Remove warning when there is a non-TextRegion in the ReadingOrder	2020-10-21 17:03:55 +02:00
Gerber, Mike	8b4ee20a40	✨ Add a new CLI tool dinglehopper-extract to just give the extracted text	2020-10-21 16:30:48 +02:00
Gerber, Mike	b23b75b601	✨ dinglehopper: Give segment ids from the extracted textequiv_level	2020-10-21 16:04:33 +02:00
Gerber, Mike	b23e4ce30e	✨ dinglehopper: Add OCR-D parameter to choose TextEquiv level	2020-10-21 14:38:19 +02:00
Gerber, Mike	9744fa2567	✨ dinglehopper: Add CLI option to choose TextEquiv level	2020-10-20 19:33:39 +02:00
Gerber, Mike	75733039b8	🧹 dinglehopper: Do not hardcode joiner to \n	2020-10-20 18:43:56 +02:00
Gerber, Mike	3848412349	✨ dinglehopper: Implement the basic text extraction from PAGE TextLines	2020-10-20 18:40:21 +02:00
Gerber, Mike	f2367ac0c3	🐛 Fix OCR-D CLI for newest OCR-D Now that find_files() is a generator, we can't use [0] to get the file.	2020-10-16 14:58:27 +02:00
Gerber, Mike	5ed184c8c4	✨ dinglehopper: Show a progressbar on --progress	2020-10-15 16:09:54 +02:00
Gerber, Mike	4951823a29	🧹 dinglehopper: Disable metrics in JSON report, too	2020-10-15 15:38:15 +02:00
Gerber, Mike	82217a25bb	🧹 dinglehopper: Move all normalization code to extracted_text.py	2020-10-08 17:29:25 +02:00
Gerber, Mike	c6c6b8efab	📝 dinglehopper: Add detail about the text extraction and ExtractedText	2020-10-08 17:05:36 +02:00
Gerber, Mike	f50591abac	Merge branch 'feat/display-segment-id'	2020-10-08 13:39:38 +02:00
Gerber, Mike	c514abfb9f	🧹 dinglehopper: Sanitize imports	2020-10-08 13:33:19 +02:00
Gerber, Mike	1077dc64ce	➡️ dinglehopper: Move ExtractedText to its own file	2020-10-08 13:25:20 +02:00
Gerber, Mike	9dd4ff0aae	✨ dinglehopper: Extract line IDs for ALTO	2020-10-08 12:54:28 +02:00
Gerber, Mike	f3aafb6fdf	✨ dinglehopper: Validate ExtractedText.{segments,_text} in both directions	2020-10-08 12:20:27 +02:00
Gerber, Mike	b14c35e147	🎨 dinglehopper: Use multimethod to handle str vs ExtractedText	2020-10-08 12:15:58 +02:00

1 2 3

118 commits