Benjamin Rosemann
6ff831dfd2
Sort textlines with missing indices
...
Python's `sorted` method will fail with a TypeError when called with
`None` and Integers:
```python
>>> sorted([None, 1])
TypeError: '<' not supported between instances of 'int' and 'NoneType'
```
Therefore we are using `float('inf')` instead of `None` in case of
missing textline indices.
2020-10-29 10:03:40 +01:00
5cbe148741
🐛 dinglehopper: Skip pages if there is no GT nor OCR (Fixes GH-34)
2020-10-21 19:29:45 +02:00
e4e2777cb7
🐛 dinglehopper: Do try to get text when no TextEquivs exist
2020-10-21 17:59:44 +02:00
1c88891a98
✔️ Add test data for LAREX's indexed TextEquivs (unused)
2020-10-21 17:51:15 +02:00
19d15e3ecc
🐛 dinglehopper: Honor TextEquiv index (Closes GH-33)
2020-10-21 17:50:21 +02:00
f626a2ebe6
🧹 dinglehopper: Remove warning when there is a non-TextRegion in the ReadingOrder
2020-10-21 17:03:55 +02:00
8b4ee20a40
✨ Add a new CLI tool dinglehopper-extract to just give the extracted text
2020-10-21 16:30:48 +02:00
b23b75b601
✨ dinglehopper: Give segment ids from the extracted textequiv_level
2020-10-21 16:04:33 +02:00
b23e4ce30e
✨ dinglehopper: Add OCR-D parameter to choose TextEquiv level
2020-10-21 14:38:19 +02:00
9744fa2567
✨ dinglehopper: Add CLI option to choose TextEquiv level
2020-10-20 19:33:39 +02:00
75733039b8
🧹 dinglehopper: Do not hardcode joiner to \n
2020-10-20 18:43:56 +02:00
3848412349
✨ dinglehopper: Implement the basic text extraction from PAGE TextLines
2020-10-20 18:40:21 +02:00
f2367ac0c3
🐛 Fix OCR-D CLI for newest OCR-D
...
Now that find_files() is a generator, we can't use [0] to get the file.
2020-10-16 14:58:27 +02:00
5ed184c8c4
✨ dinglehopper: Show a progressbar on --progress
2020-10-15 16:09:54 +02:00
4951823a29
🧹 dinglehopper: Disable metrics in JSON report, too
2020-10-15 15:38:15 +02:00
82217a25bb
🧹 dinglehopper: Move all normalization code to extracted_text.py
2020-10-08 17:29:25 +02:00
c6c6b8efab
📝 dinglehopper: Add detail about the text extraction and ExtractedText
2020-10-08 17:05:36 +02:00
f50591abac
Merge branch 'feat/display-segment-id'
2020-10-08 13:39:38 +02:00
c514abfb9f
🧹 dinglehopper: Sanitize imports
2020-10-08 13:33:19 +02:00
1077dc64ce
➡️ dinglehopper: Move ExtractedText to its own file
2020-10-08 13:25:20 +02:00
9dd4ff0aae
✨ dinglehopper: Extract line IDs for ALTO
2020-10-08 12:54:28 +02:00
f3aafb6fdf
✨ dinglehopper: Validate ExtractedText.{segments,_text} in both directions
2020-10-08 12:20:27 +02:00
b14c35e147
🎨 dinglehopper: Use multimethod to handle str vs ExtractedText
2020-10-08 12:15:58 +02:00
a17ee2afec
🚧 dinglehopper: Guarantee NFC + rename from_text → from_str
2020-10-08 11:25:01 +02:00
7843824eaf
🚧 dinglehopper: Support str & ExtractedText in CER and distance functions
2020-10-08 10:47:20 +02:00
5bee55c896
💩 dinglehopper: Fix OCR-D CLI test by working around ocrd_cli_wrap_processor() check for arguments
2020-10-07 18:40:06 +02:00
96b55f1806
🚧 dinglehopper: Hierarchical text representation
2020-10-07 18:31:52 +02:00
d706ef4621
📝 Document CER/WER and the format detection (Fixes GH-26)
2020-09-30 17:58:05 +02:00
da47e41c85
💩 dinglehopper: Fix OCR-D CLI test by working around ocrd_cli_wrap_processor() check for arguments
2020-09-25 14:53:19 +02:00
7085ee0fd8
Merge pull request #29 from kba/getlogger
...
getLogger per method
2020-09-25 13:20:58 +02:00
77154ef256
📝 dinglehopper: Document REPORT_PREFIX (Closes GH-27.)
2020-09-24 20:58:15 +02:00
Konstantin Baierer
12da98e477
getLogger per method
2020-09-24 10:16:52 +02:00
Konstantin Baierer
004ae298ca
ocrd cli: use make_file_id and assert_file_grp_cardinality
2020-08-07 18:00:33 +02:00
6ab38f1bda
🎨 dinglehopper: Make PyCharm happier with the type hinting, newlines etc.
2020-06-18 13:27:59 +02:00
d484810038
✨ dinglehopper: Validate read segment ids
2020-06-18 13:27:59 +02:00
d39f74f11a
🧹 dinglehopper: Remove obsolete normalization-related FIXME
2020-06-18 13:27:59 +02:00
8c5f7c73d5
🧹 dinglehopper: Replace XXX with an actual comment
2020-06-18 13:27:59 +02:00
37edc0336f
🧹 dinglehopper: Remove obsolete XXX that has a GitHub issue
2020-06-18 13:27:59 +02:00
9f05e6ca4c
🧹 dinglehopper: Remove obsolete XXX about None ids
2020-06-18 13:27:59 +02:00
4469af62c8
🎨 dinglehopper: Unfuck substitutions a bit
2020-06-18 13:27:59 +02:00
079be203bd
🐛 dinglehopper: Fix tests to deal with new normalization logic
2020-06-18 13:27:59 +02:00
c010a7f05e
🧹 dinglehopper: Calculate segment ids once, on the first call
2020-06-18 13:27:59 +02:00
0cf7ff4721
🧹 dinglehopper: Remove obsolete XXX about the PAGE hierarchy
2020-06-18 13:27:59 +02:00
c432cb505a
🧹 dinglehopper: Clean up test_lines_similar()
2020-06-18 13:27:59 +02:00
0c33e84415
📓 dinglehopper: Document editops()
2020-06-18 13:27:59 +02:00
a61c935624
🧹 dinglehopper: Move Python 3.5 XXXs to a GitHub issue
...
See https://github.com/qurator-spk/dinglehopper/issues/20 .
2020-06-18 13:27:59 +02:00
257e4986cc
🚧 dinglehopper: Use a Bootstrap tooltip for the segment id
2020-06-18 13:27:59 +02:00
a320d5fd8f
🚧 dinglehopper: Re-introduce "substitute_equivalences" as Normalization.NFC_SBB
2020-06-18 13:27:59 +02:00
2579e0220c
🚧 dinglehopper: Remove debug output
2020-06-18 13:27:59 +02:00
d4e39d3d26
🚧 dinglehopper: Display segment id in the corresponding column
2020-06-18 13:27:59 +02:00