1
0
Fork 0
mirror of https://github.com/qurator-spk/dinglehopper.git synced 2025-06-09 03:40:12 +02:00
Commit graph

93 commits

Author SHA1 Message Date
Benjamin Rosemann
c02569b41e Fix f-strings for Python 3.5 2020-10-29 12:33:54 +01:00
Benjamin Rosemann
7b27b2834e More complex sorting for text extraction
When extracting text from TextEquiv nodes we may encounter nodes without
index or nodes that should get sorted via the conf attribute.

Therefore we added a more complex algorithm to extract a TextEquiv and
inform the user via log messages if we encounter structures that we can
handle but may produce unexpected results.
2020-10-29 10:03:40 +01:00
Benjamin Rosemann
6ff831dfd2 Sort textlines with missing indices
Python's `sorted` method will fail with a TypeError when called with
`None` and Integers:

```python
>>> sorted([None, 1])
TypeError: '<' not supported between instances of 'int' and 'NoneType'
```
Therefore we are using `float('inf')` instead of `None` in case of
missing textline indices.
2020-10-29 10:03:40 +01:00
5cbe148741 🐛 dinglehopper: Skip pages if there is no GT nor OCR (Fixes GH-34) 2020-10-21 19:29:45 +02:00
e4e2777cb7 🐛 dinglehopper: Do try to get text when no TextEquivs exist 2020-10-21 17:59:44 +02:00
1c88891a98 ✔️ Add test data for LAREX's indexed TextEquivs (unused) 2020-10-21 17:51:15 +02:00
19d15e3ecc 🐛 dinglehopper: Honor TextEquiv index (Closes GH-33) 2020-10-21 17:50:21 +02:00
f626a2ebe6 🧹 dinglehopper: Remove warning when there is a non-TextRegion in the ReadingOrder 2020-10-21 17:03:55 +02:00
8b4ee20a40 Add a new CLI tool dinglehopper-extract to just give the extracted text 2020-10-21 16:30:48 +02:00
b23b75b601 dinglehopper: Give segment ids from the extracted textequiv_level 2020-10-21 16:04:33 +02:00
b23e4ce30e dinglehopper: Add OCR-D parameter to choose TextEquiv level 2020-10-21 14:38:19 +02:00
9744fa2567 dinglehopper: Add CLI option to choose TextEquiv level 2020-10-20 19:33:39 +02:00
75733039b8 🧹 dinglehopper: Do not hardcode joiner to \n 2020-10-20 18:43:56 +02:00
3848412349 dinglehopper: Implement the basic text extraction from PAGE TextLines 2020-10-20 18:40:21 +02:00
f2367ac0c3 🐛 Fix OCR-D CLI for newest OCR-D
Now that find_files() is a generator, we can't use [0] to get the file.
2020-10-16 14:58:27 +02:00
5ed184c8c4 dinglehopper: Show a progressbar on --progress 2020-10-15 16:09:54 +02:00
4951823a29 🧹 dinglehopper: Disable metrics in JSON report, too 2020-10-15 15:38:15 +02:00
82217a25bb 🧹 dinglehopper: Move all normalization code to extracted_text.py 2020-10-08 17:29:25 +02:00
c6c6b8efab 📝 dinglehopper: Add detail about the text extraction and ExtractedText 2020-10-08 17:05:36 +02:00
f50591abac Merge branch 'feat/display-segment-id' 2020-10-08 13:39:38 +02:00
c514abfb9f 🧹 dinglehopper: Sanitize imports 2020-10-08 13:33:19 +02:00
1077dc64ce ➡️ dinglehopper: Move ExtractedText to its own file 2020-10-08 13:25:20 +02:00
9dd4ff0aae dinglehopper: Extract line IDs for ALTO 2020-10-08 12:54:28 +02:00
f3aafb6fdf dinglehopper: Validate ExtractedText.{segments,_text} in both directions 2020-10-08 12:20:27 +02:00
b14c35e147 🎨 dinglehopper: Use multimethod to handle str vs ExtractedText 2020-10-08 12:15:58 +02:00
a17ee2afec 🚧 dinglehopper: Guarantee NFC + rename from_text → from_str 2020-10-08 11:25:01 +02:00
7843824eaf 🚧 dinglehopper: Support str & ExtractedText in CER and distance functions 2020-10-08 10:47:20 +02:00
5bee55c896 💩 dinglehopper: Fix OCR-D CLI test by working around ocrd_cli_wrap_processor() check for arguments 2020-10-07 18:40:06 +02:00
96b55f1806 🚧 dinglehopper: Hierarchical text representation 2020-10-07 18:31:52 +02:00
d706ef4621 📝 Document CER/WER and the format detection (Fixes GH-26) 2020-09-30 17:58:05 +02:00
da47e41c85 💩 dinglehopper: Fix OCR-D CLI test by working around ocrd_cli_wrap_processor() check for arguments 2020-09-25 14:53:19 +02:00
7085ee0fd8
Merge pull request #29 from kba/getlogger
getLogger per method
2020-09-25 13:20:58 +02:00
77154ef256 📝 dinglehopper: Document REPORT_PREFIX (Closes GH-27.) 2020-09-24 20:58:15 +02:00
Konstantin Baierer
12da98e477 getLogger per method 2020-09-24 10:16:52 +02:00
Konstantin Baierer
004ae298ca ocrd cli: use make_file_id and assert_file_grp_cardinality 2020-08-07 18:00:33 +02:00
6ab38f1bda 🎨 dinglehopper: Make PyCharm happier with the type hinting, newlines etc. 2020-06-18 13:27:59 +02:00
d484810038 dinglehopper: Validate read segment ids 2020-06-18 13:27:59 +02:00
d39f74f11a 🧹 dinglehopper: Remove obsolete normalization-related FIXME 2020-06-18 13:27:59 +02:00
8c5f7c73d5 🧹 dinglehopper: Replace XXX with an actual comment 2020-06-18 13:27:59 +02:00
37edc0336f 🧹 dinglehopper: Remove obsolete XXX that has a GitHub issue 2020-06-18 13:27:59 +02:00
9f05e6ca4c 🧹 dinglehopper: Remove obsolete XXX about None ids 2020-06-18 13:27:59 +02:00
4469af62c8 🎨 dinglehopper: Unfuck substitutions a bit 2020-06-18 13:27:59 +02:00
079be203bd 🐛 dinglehopper: Fix tests to deal with new normalization logic 2020-06-18 13:27:59 +02:00
c010a7f05e 🧹 dinglehopper: Calculate segment ids once, on the first call 2020-06-18 13:27:59 +02:00
0cf7ff4721 🧹 dinglehopper: Remove obsolete XXX about the PAGE hierarchy 2020-06-18 13:27:59 +02:00
c432cb505a 🧹 dinglehopper: Clean up test_lines_similar() 2020-06-18 13:27:59 +02:00
0c33e84415 📓 dinglehopper: Document editops() 2020-06-18 13:27:59 +02:00
a61c935624 🧹 dinglehopper: Move Python 3.5 XXXs to a GitHub issue
See https://github.com/qurator-spk/dinglehopper/issues/20.
2020-06-18 13:27:59 +02:00
257e4986cc 🚧 dinglehopper: Use a Bootstrap tooltip for the segment id 2020-06-18 13:27:59 +02:00
a320d5fd8f 🚧 dinglehopper: Re-introduce "substitute_equivalences" as Normalization.NFC_SBB 2020-06-18 13:27:59 +02:00