dinglehopper

mirror of https://github.com/qurator-spk/dinglehopper.git synced 2025-07-02 15:09:59 +02:00

Author	SHA1	Message	Date
Benjamin Rosemann	082e30822f	Fix method return type	2021-02-16 11:26:02 +01:00
Benjamin Rosemann	e371da899e	Switch from custom Levenshtein to python-Levenshtein As the distance and editops calculation is a performance bottleneck in this application we substituted the custom Levenshtein implementation to the C implementation in the python-Levenshtein package. We now also have separate entrypoints for texts with unicode normalization and without because this also can be done more efficiently once upon preprocessing.	2021-02-16 11:26:02 +01:00
Benjamin Rosemann	0e263cfac2	Switch between c and own implementation for distance and editops.	2021-02-16 11:26:02 +01:00
Benjamin Rosemann	11916c2dcf	Refactor tests in preparation of refactoring levenshtein.	2021-02-16 11:26:02 +01:00
Gerber, Mike	bd324331e6	🚧 dinglehopper: Try out Drone CI All checks were successful continuous-integration/drone/push Build is passing Details	2021-02-11 14:26:29 +01:00
Gerber, Mike	a59ecb795c	🚧 dinglehopper: Try out Drone CI Some checks failed continuous-integration/drone/push Build is failing Details	2021-02-11 14:15:08 +01:00
Gerber, Mike	14230e073a	🚧 dinglehopper: Try out Drone CI	2021-02-11 14:08:25 +01:00
Gerber, Mike	985666a71c	🚧 dinglehopper: Try out Drone CI	2021-02-10 20:35:22 +01:00
Gerber, Mike	4a73053cfc	🚧 Replace Travis with CircleCI	2021-02-10 18:22:52 +01:00
Gerber, Mike	e3d4493c82	🚧 Replace Travis with CircleCI	2021-02-10 17:58:58 +01:00
Gerber, Mike	27f4c3bdf8	🚧 Replace Travis with CircleCI	2021-02-10 17:57:08 +01:00
Gerber, Mike	8533e6d421	🚧 Replace Travis with CircleCI	2021-02-10 17:55:09 +01:00
Gerber, Mike	e8da8b63f8	🚧 Replace Travis with CircleCI	2021-02-10 17:53:50 +01:00
Gerber, Mike	3b7a1a5631	🚧 Replace Travis with CircleCI	2021-02-10 17:50:34 +01:00
Mike Gerber	691ce371ca	Merge pull request #50 from b2m/fix-table-extraction Fix the extraction of text from Page with TableRegion	2021-02-01 17:51:33 +01:00
Benjamin Rosemann	a68fc269d9	Fix the extraction of text from Page with TableRegion Dinglehopper did not consider `OrderedGroupIndex` in the `ReadingOrder` element when extracting text regions. As a consequence a `TableRegion` was not considered for text extraction.	2020-11-27 11:18:11 +01:00
Gerber, Mike	8cd8314c8a	🐛 dinglehopper: Bump up ocrd req for zip_input_files See also GH-49.	2020-11-19 18:59:47 +01:00
Mike Gerber	62670dd0c7	Merge pull request #49 from kba/zip_input_files ocrd cli: use core-provided zip_input_files method	2020-11-19 18:54:21 +01:00
Konstantin Baierer	74e0ac18ed	ocrd cli: use core-provided zip_input_files method	2020-11-19 16:00:28 +01:00
Gerber, Mike	389e253c11	🐛 dinglehopper: Fix alto_extract_lines()'s type annotation	2020-11-12 19:32:38 +01:00
Gerber, Mike	fe3923a8af	🐛 dinglehopper: Fix alto_extract()'s type annotation	2020-11-12 19:19:05 +01:00
Gerber, Mike	132f91d500	✔️ dinglehopper: Add missing integration test markers	2020-11-12 19:10:23 +01:00
Gerber, Mike	c48d7646df	📝 dinglehopper: README-DEV: Massage markdown a bit	2020-11-12 19:05:14 +01:00
Mike Gerber	fed021090d	Merge pull request #46 from b2m/tool-changes Tool changes	2020-11-12 18:59:25 +01:00
Benjamin Rosemann	cb1ac9d260	Add black to developer requirements.	2020-11-11 11:36:17 +01:00
Benjamin Rosemann	03ad413f4a	Added some helpful tools and configurations	2020-11-11 11:36:17 +01:00
Benjamin Rosemann	5cbd4f3d95	Preparation for black code formatter	2020-11-11 11:36:17 +01:00
Benjamin Rosemann	ce752e1912	Remove .idea folder and modify .gitignore Sharing even parts of the .idea folder in worldwide setting is bound to generate more problems than solutions. Therefore it should be removed and consequently ignore in .gitignore. Also adds some Python specific stuff to the .gitignore file.	2020-11-11 11:36:17 +01:00
Benjamin Rosemann	5270737c1f	Skip test on windows because it is unix specific.	2020-11-11 11:36:17 +01:00
Gerber, Mike	32a4b95a99	🐛 dinglehopper: Normalize in plain_extract()	2020-11-10 18:51:14 +01:00
Gerber, Mike	14421c8e53	🎨 dinglehopper: Reformat using black	2020-11-10 12:29:55 +01:00
Gerber, Mike	31c63f9e4c	🎨 dinglehopper: s/LOG/log	2020-11-09 16:55:43 +01:00
Mike Gerber	0804b029c4	Merge pull request #43 from bertsky/patch-1 1 more update for core's getLogger context	2020-11-09 16:51:00 +01:00
Robert Sachunsky	a60c14351e	1 more update for core's getLogger context	2020-11-03 17:46:59 +01:00
Mike Gerber	a51f0b3dcd	Merge pull request #42 from b2m/test-python-cache-for-travis Add travis pip caching	2020-10-30 12:35:20 +01:00
Benjamin Rosemann	b10af9f138	Test travis pip caching	2020-10-29 16:41:19 +01:00
Mike Gerber	089f6d299e	Merge pull request #37 from b2m/fix-sort-with-none Sort textlines with missing indices	2020-10-29 15:05:46 +01:00
Mike Gerber	5138a1de21	Merge pull request #39 from b2m/test-python-3.9 Add Python 3.9 to .travis.yml	2020-10-29 13:42:24 +01:00
Benjamin Rosemann	c02569b41e	Fix f-strings for Python 3.5	2020-10-29 12:33:54 +01:00
Benjamin Rosemann	7b27b2834e	More complex sorting for text extraction When extracting text from TextEquiv nodes we may encounter nodes without index or nodes that should get sorted via the conf attribute. Therefore we added a more complex algorithm to extract a TextEquiv and inform the user via log messages if we encounter structures that we can handle but may produce unexpected results.	2020-10-29 10:03:40 +01:00
Benjamin Rosemann	6ff831dfd2	Sort textlines with missing indices Python's `sorted` method will fail with a TypeError when called with `None` and Integers: ```python >>> sorted([None, 1]) TypeError: '<' not supported between instances of 'int' and 'NoneType' ``` Therefore we are using `float('inf')` instead of `None` in case of missing textline indices.	2020-10-29 10:03:40 +01:00
Benjamin Rosemann	e77f19fefc	Add Python 3.9 to .travis.yml	2020-10-29 10:02:51 +01:00
Mike Gerber	082fc9e09a	Merge pull request #38 from b2m/add-editorconfig Add .editorconfig	2020-10-28 15:16:04 +01:00
Benjamin Rosemann	20661487d6	Add .editorconfig Add a proposal for a .editorconfig file (see https://editorconfig.org/). This is natively supported by a lot of editors, others are supported via plugins. This will close #19.	2020-10-28 11:31:18 +01:00
Gerber, Mike	6e47acda1c	📝 dinglehopper: Move screenshot higher	2020-10-21 19:31:53 +02:00
Gerber, Mike	5cbe148741	🐛 dinglehopper: Skip pages if there is no GT nor OCR (Fixes GH-34)	2020-10-21 19:29:45 +02:00
Gerber, Mike	e4e2777cb7	🐛 dinglehopper: Do try to get text when no TextEquivs exist	2020-10-21 17:59:44 +02:00
Gerber, Mike	f14ae46870	Merge branch 'feat/text-extraction-levels'	2020-10-21 17:51:44 +02:00
Gerber, Mike	1c88891a98	✔️ Add test data for LAREX's indexed TextEquivs (unused)	2020-10-21 17:51:15 +02:00
Gerber, Mike	19d15e3ecc	🐛 dinglehopper: Honor TextEquiv index (Closes GH-33)	2020-10-21 17:50:21 +02:00

1 2 3 4 5

240 commits