Benjamin Rosemann
c02569b41e
Fix f-strings for Python 3.5
4 years ago
Benjamin Rosemann
7b27b2834e
More complex sorting for text extraction
...
When extracting text from TextEquiv nodes we may encounter nodes without
index or nodes that should get sorted via the conf attribute.
Therefore we added a more complex algorithm to extract a TextEquiv and
inform the user via log messages if we encounter structures that we can
handle but may produce unexpected results.
4 years ago
Benjamin Rosemann
6ff831dfd2
Sort textlines with missing indices
...
Python's `sorted` method will fail with a TypeError when called with
`None` and Integers:
```python
>>> sorted([None, 1])
TypeError: '<' not supported between instances of 'int' and 'NoneType'
```
Therefore we are using `float('inf')` instead of `None` in case of
missing textline indices.
4 years ago
Gerber, Mike
5cbe148741
🐛 dinglehopper: Skip pages if there is no GT nor OCR (Fixes GH-34)
4 years ago
Gerber, Mike
e4e2777cb7
🐛 dinglehopper: Do try to get text when no TextEquivs exist
4 years ago
Gerber, Mike
1c88891a98
✔️ Add test data for LAREX's indexed TextEquivs (unused)
4 years ago
Gerber, Mike
19d15e3ecc
🐛 dinglehopper: Honor TextEquiv index (Closes GH-33)
4 years ago
Gerber, Mike
f626a2ebe6
🧹 dinglehopper: Remove warning when there is a non-TextRegion in the ReadingOrder
4 years ago
Gerber, Mike
8b4ee20a40
✨ Add a new CLI tool dinglehopper-extract to just give the extracted text
4 years ago
Gerber, Mike
b23b75b601
✨ dinglehopper: Give segment ids from the extracted textequiv_level
4 years ago
Gerber, Mike
b23e4ce30e
✨ dinglehopper: Add OCR-D parameter to choose TextEquiv level
4 years ago
Gerber, Mike
9744fa2567
✨ dinglehopper: Add CLI option to choose TextEquiv level
4 years ago
Gerber, Mike
75733039b8
🧹 dinglehopper: Do not hardcode joiner to \n
4 years ago
Gerber, Mike
3848412349
✨ dinglehopper: Implement the basic text extraction from PAGE TextLines
4 years ago
Gerber, Mike
f2367ac0c3
🐛 Fix OCR-D CLI for newest OCR-D
...
Now that find_files() is a generator, we can't use [0] to get the file.
4 years ago
Gerber, Mike
5ed184c8c4
✨ dinglehopper: Show a progressbar on --progress
4 years ago
Gerber, Mike
4951823a29
🧹 dinglehopper: Disable metrics in JSON report, too
4 years ago
Gerber, Mike
82217a25bb
🧹 dinglehopper: Move all normalization code to extracted_text.py
4 years ago
Gerber, Mike
c6c6b8efab
📝 dinglehopper: Add detail about the text extraction and ExtractedText
4 years ago
Gerber, Mike
f50591abac
Merge branch 'feat/display-segment-id'
4 years ago
Gerber, Mike
c514abfb9f
🧹 dinglehopper: Sanitize imports
4 years ago
Gerber, Mike
1077dc64ce
➡️ dinglehopper: Move ExtractedText to its own file
4 years ago
Gerber, Mike
9dd4ff0aae
✨ dinglehopper: Extract line IDs for ALTO
4 years ago
Gerber, Mike
f3aafb6fdf
✨ dinglehopper: Validate ExtractedText.{segments,_text} in both directions
4 years ago
Gerber, Mike
b14c35e147
🎨 dinglehopper: Use multimethod to handle str vs ExtractedText
4 years ago
Gerber, Mike
a17ee2afec
🚧 dinglehopper: Guarantee NFC + rename from_text → from_str
4 years ago
Gerber, Mike
7843824eaf
🚧 dinglehopper: Support str & ExtractedText in CER and distance functions
4 years ago
Gerber, Mike
5bee55c896
💩 dinglehopper: Fix OCR-D CLI test by working around ocrd_cli_wrap_processor() check for arguments
4 years ago
Gerber, Mike
96b55f1806
🚧 dinglehopper: Hierarchical text representation
4 years ago
Gerber, Mike
d706ef4621
📝 Document CER/WER and the format detection (Fixes GH-26)
4 years ago
Gerber, Mike
da47e41c85
💩 dinglehopper: Fix OCR-D CLI test by working around ocrd_cli_wrap_processor() check for arguments
4 years ago
Mike Gerber
7085ee0fd8
Merge pull request #29 from kba/getlogger
...
getLogger per method
4 years ago
Gerber, Mike
77154ef256
📝 dinglehopper: Document REPORT_PREFIX (Closes GH-27.)
4 years ago
Konstantin Baierer
12da98e477
getLogger per method
4 years ago
Konstantin Baierer
004ae298ca
ocrd cli: use make_file_id and assert_file_grp_cardinality
4 years ago
Gerber, Mike
6ab38f1bda
🎨 dinglehopper: Make PyCharm happier with the type hinting, newlines etc.
5 years ago
Gerber, Mike
d484810038
✨ dinglehopper: Validate read segment ids
5 years ago
Gerber, Mike
d39f74f11a
🧹 dinglehopper: Remove obsolete normalization-related FIXME
5 years ago
Gerber, Mike
8c5f7c73d5
🧹 dinglehopper: Replace XXX with an actual comment
5 years ago
Gerber, Mike
37edc0336f
🧹 dinglehopper: Remove obsolete XXX that has a GitHub issue
5 years ago
Gerber, Mike
9f05e6ca4c
🧹 dinglehopper: Remove obsolete XXX about None ids
5 years ago
Gerber, Mike
4469af62c8
🎨 dinglehopper: Unfuck substitutions a bit
5 years ago
Gerber, Mike
079be203bd
🐛 dinglehopper: Fix tests to deal with new normalization logic
5 years ago
Gerber, Mike
c010a7f05e
🧹 dinglehopper: Calculate segment ids once, on the first call
5 years ago
Gerber, Mike
0cf7ff4721
🧹 dinglehopper: Remove obsolete XXX about the PAGE hierarchy
5 years ago
Gerber, Mike
c432cb505a
🧹 dinglehopper: Clean up test_lines_similar()
5 years ago
Gerber, Mike
0c33e84415
📓 dinglehopper: Document editops()
5 years ago
Gerber, Mike
a61c935624
🧹 dinglehopper: Move Python 3.5 XXXs to a GitHub issue
...
See https://github.com/qurator-spk/dinglehopper/issues/20 .
5 years ago
Gerber, Mike
257e4986cc
🚧 dinglehopper: Use a Bootstrap tooltip for the segment id
5 years ago
Gerber, Mike
a320d5fd8f
🚧 dinglehopper: Re-introduce "substitute_equivalences" as Normalization.NFC_SBB
5 years ago