Benjamin Rosemann
c02569b41e
Fix f-strings for Python 3.5
2020-10-29 12:33:54 +01:00
Benjamin Rosemann
7b27b2834e
More complex sorting for text extraction
...
When extracting text from TextEquiv nodes we may encounter nodes without
index or nodes that should get sorted via the conf attribute.
Therefore we added a more complex algorithm to extract a TextEquiv and
inform the user via log messages if we encounter structures that we can
handle but may produce unexpected results.
2020-10-29 10:03:40 +01:00
Benjamin Rosemann
6ff831dfd2
Sort textlines with missing indices
...
Python's `sorted` method will fail with a TypeError when called with
`None` and Integers:
```python
>>> sorted([None, 1])
TypeError: '<' not supported between instances of 'int' and 'NoneType'
```
Therefore we are using `float('inf')` instead of `None` in case of
missing textline indices.
2020-10-29 10:03:40 +01:00
5cbe148741
🐛 dinglehopper: Skip pages if there is no GT nor OCR (Fixes GH-34)
2020-10-21 19:29:45 +02:00
e4e2777cb7
🐛 dinglehopper: Do try to get text when no TextEquivs exist
2020-10-21 17:59:44 +02:00
1c88891a98
✔️ Add test data for LAREX's indexed TextEquivs (unused)
2020-10-21 17:51:15 +02:00
19d15e3ecc
🐛 dinglehopper: Honor TextEquiv index (Closes GH-33)
2020-10-21 17:50:21 +02:00
f626a2ebe6
🧹 dinglehopper: Remove warning when there is a non-TextRegion in the ReadingOrder
2020-10-21 17:03:55 +02:00
8b4ee20a40
✨ Add a new CLI tool dinglehopper-extract to just give the extracted text
2020-10-21 16:30:48 +02:00
b23b75b601
✨ dinglehopper: Give segment ids from the extracted textequiv_level
2020-10-21 16:04:33 +02:00
b23e4ce30e
✨ dinglehopper: Add OCR-D parameter to choose TextEquiv level
2020-10-21 14:38:19 +02:00
9744fa2567
✨ dinglehopper: Add CLI option to choose TextEquiv level
2020-10-20 19:33:39 +02:00
75733039b8
🧹 dinglehopper: Do not hardcode joiner to \n
2020-10-20 18:43:56 +02:00
3848412349
✨ dinglehopper: Implement the basic text extraction from PAGE TextLines
2020-10-20 18:40:21 +02:00
f2367ac0c3
🐛 Fix OCR-D CLI for newest OCR-D
...
Now that find_files() is a generator, we can't use [0] to get the file.
2020-10-16 14:58:27 +02:00
5ed184c8c4
✨ dinglehopper: Show a progressbar on --progress
2020-10-15 16:09:54 +02:00
4951823a29
🧹 dinglehopper: Disable metrics in JSON report, too
2020-10-15 15:38:15 +02:00
82217a25bb
🧹 dinglehopper: Move all normalization code to extracted_text.py
2020-10-08 17:29:25 +02:00
c6c6b8efab
📝 dinglehopper: Add detail about the text extraction and ExtractedText
2020-10-08 17:05:36 +02:00
f50591abac
Merge branch 'feat/display-segment-id'
2020-10-08 13:39:38 +02:00
c514abfb9f
🧹 dinglehopper: Sanitize imports
2020-10-08 13:33:19 +02:00
1077dc64ce
➡️ dinglehopper: Move ExtractedText to its own file
2020-10-08 13:25:20 +02:00
9dd4ff0aae
✨ dinglehopper: Extract line IDs for ALTO
2020-10-08 12:54:28 +02:00
f3aafb6fdf
✨ dinglehopper: Validate ExtractedText.{segments,_text} in both directions
2020-10-08 12:20:27 +02:00
b14c35e147
🎨 dinglehopper: Use multimethod to handle str vs ExtractedText
2020-10-08 12:15:58 +02:00
a17ee2afec
🚧 dinglehopper: Guarantee NFC + rename from_text → from_str
2020-10-08 11:25:01 +02:00
7843824eaf
🚧 dinglehopper: Support str & ExtractedText in CER and distance functions
2020-10-08 10:47:20 +02:00
5bee55c896
💩 dinglehopper: Fix OCR-D CLI test by working around ocrd_cli_wrap_processor() check for arguments
2020-10-07 18:40:06 +02:00
96b55f1806
🚧 dinglehopper: Hierarchical text representation
2020-10-07 18:31:52 +02:00
d706ef4621
📝 Document CER/WER and the format detection (Fixes GH-26)
2020-09-30 17:58:05 +02:00
da47e41c85
💩 dinglehopper: Fix OCR-D CLI test by working around ocrd_cli_wrap_processor() check for arguments
2020-09-25 14:53:19 +02:00
7085ee0fd8
Merge pull request #29 from kba/getlogger
...
getLogger per method
2020-09-25 13:20:58 +02:00
77154ef256
📝 dinglehopper: Document REPORT_PREFIX (Closes GH-27.)
2020-09-24 20:58:15 +02:00
Konstantin Baierer
12da98e477
getLogger per method
2020-09-24 10:16:52 +02:00
Konstantin Baierer
004ae298ca
ocrd cli: use make_file_id and assert_file_grp_cardinality
2020-08-07 18:00:33 +02:00
6ab38f1bda
🎨 dinglehopper: Make PyCharm happier with the type hinting, newlines etc.
2020-06-18 13:27:59 +02:00
d484810038
✨ dinglehopper: Validate read segment ids
2020-06-18 13:27:59 +02:00
d39f74f11a
🧹 dinglehopper: Remove obsolete normalization-related FIXME
2020-06-18 13:27:59 +02:00
8c5f7c73d5
🧹 dinglehopper: Replace XXX with an actual comment
2020-06-18 13:27:59 +02:00
37edc0336f
🧹 dinglehopper: Remove obsolete XXX that has a GitHub issue
2020-06-18 13:27:59 +02:00
9f05e6ca4c
🧹 dinglehopper: Remove obsolete XXX about None ids
2020-06-18 13:27:59 +02:00
4469af62c8
🎨 dinglehopper: Unfuck substitutions a bit
2020-06-18 13:27:59 +02:00
079be203bd
🐛 dinglehopper: Fix tests to deal with new normalization logic
2020-06-18 13:27:59 +02:00
c010a7f05e
🧹 dinglehopper: Calculate segment ids once, on the first call
2020-06-18 13:27:59 +02:00
0cf7ff4721
🧹 dinglehopper: Remove obsolete XXX about the PAGE hierarchy
2020-06-18 13:27:59 +02:00
c432cb505a
🧹 dinglehopper: Clean up test_lines_similar()
2020-06-18 13:27:59 +02:00
0c33e84415
📓 dinglehopper: Document editops()
2020-06-18 13:27:59 +02:00
a61c935624
🧹 dinglehopper: Move Python 3.5 XXXs to a GitHub issue
...
See https://github.com/qurator-spk/dinglehopper/issues/20 .
2020-06-18 13:27:59 +02:00
257e4986cc
🚧 dinglehopper: Use a Bootstrap tooltip for the segment id
2020-06-18 13:27:59 +02:00
a320d5fd8f
🚧 dinglehopper: Re-introduce "substitute_equivalences" as Normalization.NFC_SBB
2020-06-18 13:27:59 +02:00