Benjamin Rosemann
7b27b2834e
More complex sorting for text extraction
...
When extracting text from TextEquiv nodes we may encounter nodes without
index or nodes that should get sorted via the conf attribute.
Therefore we added a more complex algorithm to extract a TextEquiv and
inform the user via log messages if we encounter structures that we can
handle but may produce unexpected results.
4 years ago
Benjamin Rosemann
6ff831dfd2
Sort textlines with missing indices
...
Python's `sorted` method will fail with a TypeError when called with
`None` and Integers:
```python
>>> sorted([None, 1])
TypeError: '<' not supported between instances of 'int' and 'NoneType'
```
Therefore we are using `float('inf')` instead of `None` in case of
missing textline indices.
4 years ago
Gerber, Mike
5cbe148741
🐛 dinglehopper: Skip pages if there is no GT nor OCR (Fixes GH-34)
4 years ago
Gerber, Mike
e4e2777cb7
🐛 dinglehopper: Do try to get text when no TextEquivs exist
4 years ago
Gerber, Mike
1c88891a98
✔️ Add test data for LAREX's indexed TextEquivs (unused)
4 years ago
Gerber, Mike
19d15e3ecc
🐛 dinglehopper: Honor TextEquiv index (Closes GH-33)
4 years ago
Gerber, Mike
f626a2ebe6
🧹 dinglehopper: Remove warning when there is a non-TextRegion in the ReadingOrder
4 years ago
Gerber, Mike
8b4ee20a40
✨ Add a new CLI tool dinglehopper-extract to just give the extracted text
4 years ago
Gerber, Mike
b23b75b601
✨ dinglehopper: Give segment ids from the extracted textequiv_level
4 years ago
Gerber, Mike
b23e4ce30e
✨ dinglehopper: Add OCR-D parameter to choose TextEquiv level
4 years ago
Gerber, Mike
9744fa2567
✨ dinglehopper: Add CLI option to choose TextEquiv level
4 years ago
Gerber, Mike
75733039b8
🧹 dinglehopper: Do not hardcode joiner to \n
4 years ago
Gerber, Mike
3848412349
✨ dinglehopper: Implement the basic text extraction from PAGE TextLines
4 years ago
Gerber, Mike
f2367ac0c3
🐛 Fix OCR-D CLI for newest OCR-D
...
Now that find_files() is a generator, we can't use [0] to get the file.
4 years ago
Gerber, Mike
5ed184c8c4
✨ dinglehopper: Show a progressbar on --progress
4 years ago
Gerber, Mike
4951823a29
🧹 dinglehopper: Disable metrics in JSON report, too
4 years ago
Gerber, Mike
82217a25bb
🧹 dinglehopper: Move all normalization code to extracted_text.py
4 years ago
Gerber, Mike
c6c6b8efab
📝 dinglehopper: Add detail about the text extraction and ExtractedText
4 years ago
Gerber, Mike
f50591abac
Merge branch 'feat/display-segment-id'
4 years ago
Gerber, Mike
c514abfb9f
🧹 dinglehopper: Sanitize imports
4 years ago
Gerber, Mike
1077dc64ce
➡️ dinglehopper: Move ExtractedText to its own file
4 years ago
Gerber, Mike
9dd4ff0aae
✨ dinglehopper: Extract line IDs for ALTO
4 years ago
Gerber, Mike
f3aafb6fdf
✨ dinglehopper: Validate ExtractedText.{segments,_text} in both directions
4 years ago
Gerber, Mike
b14c35e147
🎨 dinglehopper: Use multimethod to handle str vs ExtractedText
4 years ago
Gerber, Mike
a17ee2afec
🚧 dinglehopper: Guarantee NFC + rename from_text → from_str
4 years ago
Gerber, Mike
7843824eaf
🚧 dinglehopper: Support str & ExtractedText in CER and distance functions
4 years ago
Gerber, Mike
5bee55c896
💩 dinglehopper: Fix OCR-D CLI test by working around ocrd_cli_wrap_processor() check for arguments
4 years ago
Gerber, Mike
96b55f1806
🚧 dinglehopper: Hierarchical text representation
4 years ago
Gerber, Mike
d706ef4621
📝 Document CER/WER and the format detection (Fixes GH-26)
4 years ago
Gerber, Mike
da47e41c85
💩 dinglehopper: Fix OCR-D CLI test by working around ocrd_cli_wrap_processor() check for arguments
4 years ago
Mike Gerber
7085ee0fd8
Merge pull request #29 from kba/getlogger
...
getLogger per method
4 years ago
Gerber, Mike
77154ef256
📝 dinglehopper: Document REPORT_PREFIX (Closes GH-27.)
4 years ago
Konstantin Baierer
12da98e477
getLogger per method
4 years ago
Konstantin Baierer
004ae298ca
ocrd cli: use make_file_id and assert_file_grp_cardinality
4 years ago
Gerber, Mike
6ab38f1bda
🎨 dinglehopper: Make PyCharm happier with the type hinting, newlines etc.
5 years ago
Gerber, Mike
d484810038
✨ dinglehopper: Validate read segment ids
5 years ago
Gerber, Mike
d39f74f11a
🧹 dinglehopper: Remove obsolete normalization-related FIXME
5 years ago
Gerber, Mike
8c5f7c73d5
🧹 dinglehopper: Replace XXX with an actual comment
5 years ago
Gerber, Mike
37edc0336f
🧹 dinglehopper: Remove obsolete XXX that has a GitHub issue
5 years ago
Gerber, Mike
9f05e6ca4c
🧹 dinglehopper: Remove obsolete XXX about None ids
5 years ago
Gerber, Mike
4469af62c8
🎨 dinglehopper: Unfuck substitutions a bit
5 years ago
Gerber, Mike
079be203bd
🐛 dinglehopper: Fix tests to deal with new normalization logic
5 years ago
Gerber, Mike
c010a7f05e
🧹 dinglehopper: Calculate segment ids once, on the first call
5 years ago
Gerber, Mike
0cf7ff4721
🧹 dinglehopper: Remove obsolete XXX about the PAGE hierarchy
5 years ago
Gerber, Mike
c432cb505a
🧹 dinglehopper: Clean up test_lines_similar()
5 years ago
Gerber, Mike
0c33e84415
📓 dinglehopper: Document editops()
5 years ago
Gerber, Mike
a61c935624
🧹 dinglehopper: Move Python 3.5 XXXs to a GitHub issue
...
See https://github.com/qurator-spk/dinglehopper/issues/20 .
5 years ago
Gerber, Mike
257e4986cc
🚧 dinglehopper: Use a Bootstrap tooltip for the segment id
5 years ago
Gerber, Mike
a320d5fd8f
🚧 dinglehopper: Re-introduce "substitute_equivalences" as Normalization.NFC_SBB
5 years ago
Gerber, Mike
2579e0220c
🚧 dinglehopper: Remove debug output
5 years ago
Gerber, Mike
d4e39d3d26
🚧 dinglehopper: Display segment id in the corresponding column
5 years ago
Gerber, Mike
48ad340428
🚧 dinglehopper: Display segment id when hovering over a character difference
5 years ago
Gerber, Mike
1f6538b44c
🚧 dinglehopper: Extract text while retaining segment id info
5 years ago
Gerber, Mike
275ff32524
🚧 dinglehopper: Extract text while retaining segment id info
5 years ago
Gerber, Mike
4e182e0794
🚧 dinglehopper: Extract text while retaining segment id info
5 years ago
Gerber, Mike
9f8bb1d8ea
🚧 dinglehopper: Extract text while retaining segment id info
5 years ago
Gerber, Mike
668de758a0
✨ dinglehopper: Support disabling metrics in the OCR-D interface
5 years ago
Gerber, Mike
f699697eb3
🐛 dinglehopper: Fix reading OCR-D workspace files when only URLs are provided
5 years ago
Gerber, Mike
22765f02a2
🐛 dinglehopper: Fix tests by making metrics a keyword argument
5 years ago
Gerber, Mike
5cbeb7b0dd
✨ dinglehopper: Support disabling the metrics using CLI option --no-metrics
5 years ago
Gerber, Mike
745095e52c
✨ dinglehopper: Include number of characters and words in JSON report
5 years ago
Gerber, Mike
48a31ce672
Revert "Merge branch 'master' of https://github.com/qurator-spk/sbb_textline_detector "
...
This reverts commit 2c89bf3b35ee290d7b830ef270df3a96aa48245e, reversing
changes made to 9f7e413148ca5dbac9b555d7b0d0a5fa3a0f5340.
5 years ago
b-vr103
1303a7d92f
Merge branch 'master' of https://github.com/qurator-spk/sbb_textline_detector
5 years ago
Gerber, Mike
f32eb9eb69
🐛 dinglehopper: Escape text inserted into HTML ( Fixes #8 )
5 years ago
Gerber, Mike
82e863fac2
📝 dinglehopper: Document seq_editops()
5 years ago
Gerber, Mike
5ccdace1dd
🎨 dinglehopper: Move working_directory() context manager into tests/util
5 years ago
Gerber, Mike
f98c527c93
🐛 dinglehopper: Fix working_directory() context manager
5 years ago
Gerber, Mike
5273d10bac
🐛 dinglehopper: Generate a loadable JSON report even if CER=∞
5 years ago
Gerber, Mike
ced6504ad0
🎨 dinglehopper: Expose clearing the Levenshtein cache as a function
5 years ago
Gerber, Mike
5cf4eddaeb
⚡ dinglehopper: Clear Levenshtein cache between OCR-D files
5 years ago
Gerber, Mike
58ff140bc0
⚡ ️ dinglehopper: Improve performance by caching the Levensthein matrix
...
Motivated by [a pull
request](https://github.com/qurator-spk/dinglehopper/pull/7 ) by
@JKamlah, implement a cache of the Levensthein matrix calculation.
We calculated the Levenshtein matrixes for characters and words twice:
Once for the error rates, once for the alignment.
5 years ago
Gerber, Mike
11a6341641
🧹 dinglehopper: Remove broken implementation of the unordered word error rate
5 years ago
Gerber, Mike
f22228840e
🧹 dinglehopper: Use exclusively relative imports in tests
5 years ago
Gerber, Mike
d61c076aad
🧹 dinglehopper: Remove debug print()s
5 years ago
Gerber, Mike
12a48f3bfe
✅ dinglehopper: Test aligning lists of lines
5 years ago
Gerber, Mike
680c2a2661
🐛 dinglehopper: Fix test_ocrd_cli for Python 3.5, again, and again
5 years ago
Gerber, Mike
7cf1a540f4
🐛 dinglehopper: Fix test_ocrd_cli for Python 3.5, again
5 years ago
Gerber, Mike
49e2065ad6
🐛 dinglehopper: Fix test_ocrd_cli for Python 3.5
5 years ago
Gerber, Mike
86178271df
✅ dinglehopper: Fix repeated tests for the OCR-D interface
5 years ago
Gerber, Mike
b6f50ef853
✅ dinglehopper: Add a test for the OCR-D interface
5 years ago
Konstantin Baierer
2ca44af31d
ocrd-tool: add category
5 years ago
Gerber, Mike
c30553985f
� dinglehopper: Substitute more characters
5 years ago
Gerber, Mike
493541fddf
🐛 dinglehopper: Always work with NFC text
5 years ago
Gerber, Mike
df93c80e5d
🐛 dinglehopper: Always work with NFC text
5 years ago
Gerber, Mike
715b813bbc
� dinglehopper: Add two more eMOP ligatures
5 years ago
Gerber, Mike
8d055e7b6e
🐛 dinglehopper: Work on NFC'ed grapheme clusters when aligning text
5 years ago
Gerber, Mike
534958be1d
🐛 dinglehopper: Fix sorting the reading order
...
Regions were sorted wrongly when there are more than 9 regions in an
OrderedGroup because the index was sorted alphabetically, not
numerically. Fix this by converting the index to integers.
5 years ago
Gerber, Mike
10f010eaa8
🐛 dinglehopper: Do not throw error if a region ID is not found
...
The ReadingOrder might contain regions of types other than text regions,
so not finding a TextRegion with the referenced ID is not an error.
Downgrade to a warning for now.
5 years ago
Gerber, Mike
8237b3edaf
� dinglehopper: Substitute more characters
5 years ago
Gerber, Mike
02a0e093bf
✨ dinglehopper: Add OCR-D interface
5 years ago
Gerber, Mike
495919c06d
🧹 dinglehopper: Move pytest.ini
5 years ago
Gerber, Mike
89048bf55d
➡ Move dinglehopper into its own directory
5 years ago