Commit Graph

94 Commits (0804b029c4e1c953d69c0f4fce3ecfecdd54062f)

Author SHA1 Message Date
Robert Sachunsky a60c14351e
1 more update for core's getLogger context
Benjamin Rosemann c02569b41e Fix f-strings for Python 3.5
Benjamin Rosemann 7b27b2834e More complex sorting for text extraction
When extracting text from TextEquiv nodes we may encounter nodes without
index or nodes that should get sorted via the conf attribute.

Therefore we added a more complex algorithm to extract a TextEquiv and
inform the user via log messages if we encounter structures that we can
handle but may produce unexpected results.
Benjamin Rosemann 6ff831dfd2 Sort textlines with missing indices
Python's `sorted` method will fail with a TypeError when called with
`None` and Integers:

```python
>>> sorted([None, 1])
TypeError: '<' not supported between instances of 'int' and 'NoneType'
```
Therefore we are using `float('inf')` instead of `None` in case of
missing textline indices.
Gerber, Mike 5cbe148741 🐛 dinglehopper: Skip pages if there is no GT nor OCR (Fixes GH-34)
Gerber, Mike e4e2777cb7 🐛 dinglehopper: Do try to get text when no TextEquivs exist
Gerber, Mike 1c88891a98 ✔️ Add test data for LAREX's indexed TextEquivs (unused)
Gerber, Mike 19d15e3ecc 🐛 dinglehopper: Honor TextEquiv index (Closes GH-33)
Gerber, Mike f626a2ebe6 🧹 dinglehopper: Remove warning when there is a non-TextRegion in the ReadingOrder
Gerber, Mike 8b4ee20a40 Add a new CLI tool dinglehopper-extract to just give the extracted text
Gerber, Mike b23b75b601 dinglehopper: Give segment ids from the extracted textequiv_level
Gerber, Mike b23e4ce30e dinglehopper: Add OCR-D parameter to choose TextEquiv level
Gerber, Mike 9744fa2567 dinglehopper: Add CLI option to choose TextEquiv level
Gerber, Mike 75733039b8 🧹 dinglehopper: Do not hardcode joiner to \n
Gerber, Mike 3848412349 dinglehopper: Implement the basic text extraction from PAGE TextLines
Gerber, Mike f2367ac0c3 🐛 Fix OCR-D CLI for newest OCR-D
Now that find_files() is a generator, we can't use [0] to get the file.
Gerber, Mike 5ed184c8c4 dinglehopper: Show a progressbar on --progress
Gerber, Mike 4951823a29 🧹 dinglehopper: Disable metrics in JSON report, too
Gerber, Mike 82217a25bb 🧹 dinglehopper: Move all normalization code to extracted_text.py
Gerber, Mike c6c6b8efab 📝 dinglehopper: Add detail about the text extraction and ExtractedText
Gerber, Mike f50591abac Merge branch 'feat/display-segment-id'
Gerber, Mike c514abfb9f 🧹 dinglehopper: Sanitize imports
Gerber, Mike 1077dc64ce ➡️ dinglehopper: Move ExtractedText to its own file
Gerber, Mike 9dd4ff0aae dinglehopper: Extract line IDs for ALTO
Gerber, Mike f3aafb6fdf dinglehopper: Validate ExtractedText.{segments,_text} in both directions
Gerber, Mike b14c35e147 🎨 dinglehopper: Use multimethod to handle str vs ExtractedText
Gerber, Mike a17ee2afec 🚧 dinglehopper: Guarantee NFC + rename from_text → from_str
Gerber, Mike 7843824eaf 🚧 dinglehopper: Support str & ExtractedText in CER and distance functions
Gerber, Mike 5bee55c896 💩 dinglehopper: Fix OCR-D CLI test by working around ocrd_cli_wrap_processor() check for arguments
Gerber, Mike 96b55f1806 🚧 dinglehopper: Hierarchical text representation
Gerber, Mike d706ef4621 📝 Document CER/WER and the format detection (Fixes GH-26)
Gerber, Mike da47e41c85 💩 dinglehopper: Fix OCR-D CLI test by working around ocrd_cli_wrap_processor() check for arguments
Mike Gerber 7085ee0fd8
Merge pull request from kba/getlogger
getLogger per method
Gerber, Mike 77154ef256 📝 dinglehopper: Document REPORT_PREFIX (Closes GH-27.)
Konstantin Baierer 12da98e477 getLogger per method
Konstantin Baierer 004ae298ca ocrd cli: use make_file_id and assert_file_grp_cardinality
Gerber, Mike 6ab38f1bda 🎨 dinglehopper: Make PyCharm happier with the type hinting, newlines etc.
Gerber, Mike d484810038 dinglehopper: Validate read segment ids
Gerber, Mike d39f74f11a 🧹 dinglehopper: Remove obsolete normalization-related FIXME
Gerber, Mike 8c5f7c73d5 🧹 dinglehopper: Replace XXX with an actual comment
Gerber, Mike 37edc0336f 🧹 dinglehopper: Remove obsolete XXX that has a GitHub issue
Gerber, Mike 9f05e6ca4c 🧹 dinglehopper: Remove obsolete XXX about None ids
Gerber, Mike 4469af62c8 🎨 dinglehopper: Unfuck substitutions a bit
Gerber, Mike 079be203bd 🐛 dinglehopper: Fix tests to deal with new normalization logic
Gerber, Mike c010a7f05e 🧹 dinglehopper: Calculate segment ids once, on the first call
Gerber, Mike 0cf7ff4721 🧹 dinglehopper: Remove obsolete XXX about the PAGE hierarchy
Gerber, Mike c432cb505a 🧹 dinglehopper: Clean up test_lines_similar()
Gerber, Mike 0c33e84415 📓 dinglehopper: Document editops()
Gerber, Mike a61c935624 🧹 dinglehopper: Move Python 3.5 XXXs to a GitHub issue
See https://github.com/qurator-spk/dinglehopper/issues/20.
Gerber, Mike 257e4986cc 🚧 dinglehopper: Use a Bootstrap tooltip for the segment id