dinglehopper

mirror of https://github.com/qurator-spk/dinglehopper.git synced 2026-07-29 15:02:33 +02:00

Author	SHA1	Message	Date
Benjamin Rosemann	0dd5fc0ee5	Small corrections	2021-02-16 11:28:24 +01:00
Benjamin Rosemann	b24d8d5664	Performance increases Temporarily switch to the c-implementation of python-levenshtein for editops calculatation. Also added some variables, caching and type changes for performance gains.	2021-02-16 11:28:24 +01:00
Benjamin Rosemann	0ef7810dd0	Reduce number of splits for short (one char) elements	2021-02-16 11:28:24 +01:00
Benjamin Rosemann	c9219cbacd	Make sure that 0 cer and wer are reported	2021-02-16 11:28:23 +01:00
Benjamin Rosemann	fd6f57a263	Fix broken build on Python 3.5	2021-02-16 11:28:23 +01:00
Benjamin Rosemann	cac437afbf	Evaluate some performance issues	2021-02-16 11:28:23 +01:00
Benjamin Rosemann	1bc7ef6c8b	Correct report for fca As the fca implementation already knows the editing operations for each segment we use a different sequence alignment method.	2021-02-16 11:28:23 +01:00
Benjamin Rosemann	750ad00d1b	Add tooltips to fca report	2021-02-16 11:28:23 +01:00
Benjamin Rosemann	53064bf833	Include fca as parameter and add some tests	2021-02-16 11:28:23 +01:00
Benjamin Rosemann	4a87adc2c7	Implement version specific data structures As ocr-d continues the support for Python 3.5 until the end of this year version specific data structures have been implemented. When the support for Python 3.5 is dropped the extra file can easily be removed.	2021-02-16 11:28:23 +01:00
Benjamin Rosemann	2a215a1062	Reformat using black	2021-02-16 11:28:23 +01:00
Benjamin Rosemann	5277593bdb	Fix some special cases	2021-02-16 11:28:23 +01:00
Benjamin Rosemann	d7a74fa58b	First draft of flexible character accuracy	2021-02-16 11:28:23 +01:00
Benjamin Rosemann	a68fc269d9	Fix the extraction of text from Page with TableRegion Dinglehopper did not consider `OrderedGroupIndex` in the `ReadingOrder` element when extracting text regions. As a consequence a `TableRegion` was not considered for text extraction.	2020-11-27 11:18:11 +01:00
Konstantin Baierer	74e0ac18ed	ocrd cli: use core-provided zip_input_files method	2020-11-19 16:00:28 +01:00
Gerber, Mike	389e253c11	🐛 dinglehopper: Fix alto_extract_lines()'s type annotation	2020-11-12 19:32:38 +01:00
Gerber, Mike	fe3923a8af	🐛 dinglehopper: Fix alto_extract()'s type annotation	2020-11-12 19:19:05 +01:00
Gerber, Mike	132f91d500	✔️ dinglehopper: Add missing integration test markers	2020-11-12 19:10:23 +01:00
Benjamin Rosemann	ce752e1912	Remove .idea folder and modify .gitignore Sharing even parts of the .idea folder in worldwide setting is bound to generate more problems than solutions. Therefore it should be removed and consequently ignore in .gitignore. Also adds some Python specific stuff to the .gitignore file.	2020-11-11 11:36:17 +01:00
Benjamin Rosemann	5270737c1f	Skip test on windows because it is unix specific.	2020-11-11 11:36:17 +01:00
Gerber, Mike	32a4b95a99	🐛 dinglehopper: Normalize in plain_extract()	2020-11-10 18:51:14 +01:00
Gerber, Mike	14421c8e53	🎨 dinglehopper: Reformat using black	2020-11-10 12:29:55 +01:00
Gerber, Mike	31c63f9e4c	🎨 dinglehopper: s/LOG/log	2020-11-09 16:55:43 +01:00
Robert Sachunsky	a60c14351e	1 more update for core's getLogger context	2020-11-03 17:46:59 +01:00
Benjamin Rosemann	c02569b41e	Fix f-strings for Python 3.5	2020-10-29 12:33:54 +01:00
Benjamin Rosemann	7b27b2834e	More complex sorting for text extraction When extracting text from TextEquiv nodes we may encounter nodes without index or nodes that should get sorted via the conf attribute. Therefore we added a more complex algorithm to extract a TextEquiv and inform the user via log messages if we encounter structures that we can handle but may produce unexpected results.	2020-10-29 10:03:40 +01:00
Benjamin Rosemann	6ff831dfd2	Sort textlines with missing indices Python's `sorted` method will fail with a TypeError when called with `None` and Integers: ```python >>> sorted([None, 1]) TypeError: '<' not supported between instances of 'int' and 'NoneType' ``` Therefore we are using `float('inf')` instead of `None` in case of missing textline indices.	2020-10-29 10:03:40 +01:00
Gerber, Mike	5cbe148741	🐛 dinglehopper: Skip pages if there is no GT nor OCR (Fixes GH-34)	2020-10-21 19:29:45 +02:00
Gerber, Mike	e4e2777cb7	🐛 dinglehopper: Do try to get text when no TextEquivs exist	2020-10-21 17:59:44 +02:00
Gerber, Mike	1c88891a98	✔️ Add test data for LAREX's indexed TextEquivs (unused)	2020-10-21 17:51:15 +02:00
Gerber, Mike	19d15e3ecc	🐛 dinglehopper: Honor TextEquiv index (Closes GH-33)	2020-10-21 17:50:21 +02:00
Gerber, Mike	f626a2ebe6	🧹 dinglehopper: Remove warning when there is a non-TextRegion in the ReadingOrder	2020-10-21 17:03:55 +02:00
Gerber, Mike	8b4ee20a40	✨ Add a new CLI tool dinglehopper-extract to just give the extracted text	2020-10-21 16:30:48 +02:00
Gerber, Mike	b23b75b601	✨ dinglehopper: Give segment ids from the extracted textequiv_level	2020-10-21 16:04:33 +02:00
Gerber, Mike	b23e4ce30e	✨ dinglehopper: Add OCR-D parameter to choose TextEquiv level	2020-10-21 14:38:19 +02:00
Gerber, Mike	9744fa2567	✨ dinglehopper: Add CLI option to choose TextEquiv level	2020-10-20 19:33:39 +02:00
Gerber, Mike	75733039b8	🧹 dinglehopper: Do not hardcode joiner to \n	2020-10-20 18:43:56 +02:00
Gerber, Mike	3848412349	✨ dinglehopper: Implement the basic text extraction from PAGE TextLines	2020-10-20 18:40:21 +02:00
Gerber, Mike	f2367ac0c3	🐛 Fix OCR-D CLI for newest OCR-D Now that find_files() is a generator, we can't use [0] to get the file.	2020-10-16 14:58:27 +02:00
Gerber, Mike	5ed184c8c4	✨ dinglehopper: Show a progressbar on --progress	2020-10-15 16:09:54 +02:00
Gerber, Mike	4951823a29	🧹 dinglehopper: Disable metrics in JSON report, too	2020-10-15 15:38:15 +02:00
Gerber, Mike	82217a25bb	🧹 dinglehopper: Move all normalization code to extracted_text.py	2020-10-08 17:29:25 +02:00
Gerber, Mike	c6c6b8efab	📝 dinglehopper: Add detail about the text extraction and ExtractedText	2020-10-08 17:05:36 +02:00
Gerber, Mike	f50591abac	Merge branch 'feat/display-segment-id'	2020-10-08 13:39:38 +02:00
Gerber, Mike	c514abfb9f	🧹 dinglehopper: Sanitize imports	2020-10-08 13:33:19 +02:00
Gerber, Mike	1077dc64ce	➡️ dinglehopper: Move ExtractedText to its own file	2020-10-08 13:25:20 +02:00
Gerber, Mike	9dd4ff0aae	✨ dinglehopper: Extract line IDs for ALTO	2020-10-08 12:54:28 +02:00
Gerber, Mike	f3aafb6fdf	✨ dinglehopper: Validate ExtractedText.{segments,_text} in both directions	2020-10-08 12:20:27 +02:00
Gerber, Mike	b14c35e147	🎨 dinglehopper: Use multimethod to handle str vs ExtractedText	2020-10-08 12:15:58 +02:00
Gerber, Mike	a17ee2afec	🚧 dinglehopper: Guarantee NFC + rename from_text → from_str	2020-10-08 11:25:01 +02:00

1 2 3

117 commits