Commit Graph

142 Commits (de6cd8f1e7b97c27a9aeca878797d0491d8f1872)

Author SHA1 Message Date
Benjamin Rosemann 7b27b2834e More complex sorting for text extraction
When extracting text from TextEquiv nodes we may encounter nodes without
index or nodes that should get sorted via the conf attribute.

Therefore we added a more complex algorithm to extract a TextEquiv and
inform the user via log messages if we encounter structures that we can
handle but may produce unexpected results.
4 years ago
Benjamin Rosemann 6ff831dfd2 Sort textlines with missing indices
Python's `sorted` method will fail with a TypeError when called with
`None` and Integers:

```python
>>> sorted([None, 1])
TypeError: '<' not supported between instances of 'int' and 'NoneType'
```
Therefore we are using `float('inf')` instead of `None` in case of
missing textline indices.
4 years ago
Gerber, Mike 5cbe148741 🐛 dinglehopper: Skip pages if there is no GT nor OCR (Fixes GH-34) 4 years ago
Gerber, Mike e4e2777cb7 🐛 dinglehopper: Do try to get text when no TextEquivs exist 4 years ago
Gerber, Mike 1c88891a98 ✔️ Add test data for LAREX's indexed TextEquivs (unused) 4 years ago
Gerber, Mike 19d15e3ecc 🐛 dinglehopper: Honor TextEquiv index (Closes GH-33) 4 years ago
Gerber, Mike f626a2ebe6 🧹 dinglehopper: Remove warning when there is a non-TextRegion in the ReadingOrder 4 years ago
Gerber, Mike 8b4ee20a40 Add a new CLI tool dinglehopper-extract to just give the extracted text 4 years ago
Gerber, Mike b23b75b601 dinglehopper: Give segment ids from the extracted textequiv_level 4 years ago
Gerber, Mike b23e4ce30e dinglehopper: Add OCR-D parameter to choose TextEquiv level 4 years ago
Gerber, Mike 9744fa2567 dinglehopper: Add CLI option to choose TextEquiv level 4 years ago
Gerber, Mike 75733039b8 🧹 dinglehopper: Do not hardcode joiner to \n 4 years ago
Gerber, Mike 3848412349 dinglehopper: Implement the basic text extraction from PAGE TextLines 4 years ago
Gerber, Mike f2367ac0c3 🐛 Fix OCR-D CLI for newest OCR-D
Now that find_files() is a generator, we can't use [0] to get the file.
4 years ago
Gerber, Mike 5ed184c8c4 dinglehopper: Show a progressbar on --progress 4 years ago
Gerber, Mike 4951823a29 🧹 dinglehopper: Disable metrics in JSON report, too 4 years ago
Gerber, Mike 82217a25bb 🧹 dinglehopper: Move all normalization code to extracted_text.py 4 years ago
Gerber, Mike c6c6b8efab 📝 dinglehopper: Add detail about the text extraction and ExtractedText 4 years ago
Gerber, Mike f50591abac Merge branch 'feat/display-segment-id' 4 years ago
Gerber, Mike c514abfb9f 🧹 dinglehopper: Sanitize imports 4 years ago
Gerber, Mike 1077dc64ce ➡️ dinglehopper: Move ExtractedText to its own file 4 years ago
Gerber, Mike 9dd4ff0aae dinglehopper: Extract line IDs for ALTO 4 years ago
Gerber, Mike f3aafb6fdf dinglehopper: Validate ExtractedText.{segments,_text} in both directions 4 years ago
Gerber, Mike b14c35e147 🎨 dinglehopper: Use multimethod to handle str vs ExtractedText 4 years ago
Gerber, Mike a17ee2afec 🚧 dinglehopper: Guarantee NFC + rename from_text → from_str 4 years ago
Gerber, Mike 7843824eaf 🚧 dinglehopper: Support str & ExtractedText in CER and distance functions 4 years ago
Gerber, Mike 5bee55c896 💩 dinglehopper: Fix OCR-D CLI test by working around ocrd_cli_wrap_processor() check for arguments 4 years ago
Gerber, Mike 96b55f1806 🚧 dinglehopper: Hierarchical text representation 4 years ago
Gerber, Mike d706ef4621 📝 Document CER/WER and the format detection (Fixes GH-26) 4 years ago
Gerber, Mike da47e41c85 💩 dinglehopper: Fix OCR-D CLI test by working around ocrd_cli_wrap_processor() check for arguments 4 years ago
Mike Gerber 7085ee0fd8
Merge pull request #29 from kba/getlogger
getLogger per method
4 years ago
Gerber, Mike 77154ef256 📝 dinglehopper: Document REPORT_PREFIX (Closes GH-27.) 4 years ago
Konstantin Baierer 12da98e477 getLogger per method 4 years ago
Konstantin Baierer 004ae298ca ocrd cli: use make_file_id and assert_file_grp_cardinality 4 years ago
Gerber, Mike 6ab38f1bda 🎨 dinglehopper: Make PyCharm happier with the type hinting, newlines etc. 5 years ago
Gerber, Mike d484810038 dinglehopper: Validate read segment ids 5 years ago
Gerber, Mike d39f74f11a 🧹 dinglehopper: Remove obsolete normalization-related FIXME 5 years ago
Gerber, Mike 8c5f7c73d5 🧹 dinglehopper: Replace XXX with an actual comment 5 years ago
Gerber, Mike 37edc0336f 🧹 dinglehopper: Remove obsolete XXX that has a GitHub issue 5 years ago
Gerber, Mike 9f05e6ca4c 🧹 dinglehopper: Remove obsolete XXX about None ids 5 years ago
Gerber, Mike 4469af62c8 🎨 dinglehopper: Unfuck substitutions a bit 5 years ago
Gerber, Mike 079be203bd 🐛 dinglehopper: Fix tests to deal with new normalization logic 5 years ago
Gerber, Mike c010a7f05e 🧹 dinglehopper: Calculate segment ids once, on the first call 5 years ago
Gerber, Mike 0cf7ff4721 🧹 dinglehopper: Remove obsolete XXX about the PAGE hierarchy 5 years ago
Gerber, Mike c432cb505a 🧹 dinglehopper: Clean up test_lines_similar() 5 years ago
Gerber, Mike 0c33e84415 📓 dinglehopper: Document editops() 5 years ago
Gerber, Mike a61c935624 🧹 dinglehopper: Move Python 3.5 XXXs to a GitHub issue
See https://github.com/qurator-spk/dinglehopper/issues/20.
5 years ago
Gerber, Mike 257e4986cc 🚧 dinglehopper: Use a Bootstrap tooltip for the segment id 5 years ago
Gerber, Mike a320d5fd8f 🚧 dinglehopper: Re-introduce "substitute_equivalences" as Normalization.NFC_SBB 5 years ago
Gerber, Mike 2579e0220c 🚧 dinglehopper: Remove debug output 5 years ago
Gerber, Mike d4e39d3d26 🚧 dinglehopper: Display segment id in the corresponding column 5 years ago
Gerber, Mike 48ad340428 🚧 dinglehopper: Display segment id when hovering over a character difference 5 years ago
Gerber, Mike 1f6538b44c 🚧 dinglehopper: Extract text while retaining segment id info 5 years ago
Gerber, Mike 275ff32524 🚧 dinglehopper: Extract text while retaining segment id info 5 years ago
Gerber, Mike 4e182e0794 🚧 dinglehopper: Extract text while retaining segment id info 5 years ago
Gerber, Mike 9f8bb1d8ea 🚧 dinglehopper: Extract text while retaining segment id info 5 years ago
Gerber, Mike 668de758a0 dinglehopper: Support disabling metrics in the OCR-D interface 5 years ago
Gerber, Mike f699697eb3 🐛 dinglehopper: Fix reading OCR-D workspace files when only URLs are provided 5 years ago
Gerber, Mike 22765f02a2 🐛 dinglehopper: Fix tests by making metrics a keyword argument 5 years ago
Gerber, Mike 5cbeb7b0dd dinglehopper: Support disabling the metrics using CLI option --no-metrics 5 years ago
Gerber, Mike 745095e52c dinglehopper: Include number of characters and words in JSON report 5 years ago
Gerber, Mike 48a31ce672 Revert "Merge branch 'master' of https://github.com/qurator-spk/sbb_textline_detector"
This reverts commit 2c89bf3b35ee290d7b830ef270df3a96aa48245e, reversing
changes made to 9f7e413148ca5dbac9b555d7b0d0a5fa3a0f5340.
5 years ago
b-vr103 1303a7d92f Merge branch 'master' of https://github.com/qurator-spk/sbb_textline_detector 5 years ago
Gerber, Mike f32eb9eb69 🐛 dinglehopper: Escape text inserted into HTML (Fixes #8) 5 years ago
Gerber, Mike 82e863fac2 📝 dinglehopper: Document seq_editops() 5 years ago
Gerber, Mike 5ccdace1dd 🎨 dinglehopper: Move working_directory() context manager into tests/util 5 years ago
Gerber, Mike f98c527c93 🐛 dinglehopper: Fix working_directory() context manager 5 years ago
Gerber, Mike 5273d10bac 🐛 dinglehopper: Generate a loadable JSON report even if CER=∞ 5 years ago
Gerber, Mike ced6504ad0 🎨 dinglehopper: Expose clearing the Levenshtein cache as a function 5 years ago
Gerber, Mike 5cf4eddaeb dinglehopper: Clear Levenshtein cache between OCR-D files 5 years ago
Gerber, Mike 58ff140bc0 ️ dinglehopper: Improve performance by caching the Levensthein matrix
Motivated by [a pull
request](https://github.com/qurator-spk/dinglehopper/pull/7) by
@JKamlah, implement a cache of the Levensthein matrix calculation.

We calculated the Levenshtein matrixes for characters and words twice:
Once for the error rates, once for the alignment.
5 years ago
Gerber, Mike 11a6341641 🧹 dinglehopper: Remove broken implementation of the unordered word error rate 5 years ago
Gerber, Mike f22228840e 🧹 dinglehopper: Use exclusively relative imports in tests 5 years ago
Gerber, Mike d61c076aad 🧹 dinglehopper: Remove debug print()s 5 years ago
Gerber, Mike 12a48f3bfe dinglehopper: Test aligning lists of lines 5 years ago
Gerber, Mike 680c2a2661 🐛 dinglehopper: Fix test_ocrd_cli for Python 3.5, again, and again 5 years ago
Gerber, Mike 7cf1a540f4 🐛 dinglehopper: Fix test_ocrd_cli for Python 3.5, again 5 years ago
Gerber, Mike 49e2065ad6 🐛 dinglehopper: Fix test_ocrd_cli for Python 3.5 5 years ago
Gerber, Mike 86178271df dinglehopper: Fix repeated tests for the OCR-D interface 5 years ago
Gerber, Mike b6f50ef853 dinglehopper: Add a test for the OCR-D interface 5 years ago
Konstantin Baierer 2ca44af31d ocrd-tool: add category 5 years ago
Gerber, Mike c30553985f � dinglehopper: Substitute more characters 5 years ago
Gerber, Mike 493541fddf 🐛 dinglehopper: Always work with NFC text 5 years ago
Gerber, Mike df93c80e5d 🐛 dinglehopper: Always work with NFC text 5 years ago
Gerber, Mike 715b813bbc � dinglehopper: Add two more eMOP ligatures 5 years ago
Gerber, Mike 8d055e7b6e 🐛 dinglehopper: Work on NFC'ed grapheme clusters when aligning text 5 years ago
Gerber, Mike 534958be1d 🐛 dinglehopper: Fix sorting the reading order
Regions were sorted wrongly when there are more than 9 regions in an
OrderedGroup because the index was sorted alphabetically, not
numerically. Fix this by converting the index to integers.
5 years ago
Gerber, Mike 10f010eaa8 🐛 dinglehopper: Do not throw error if a region ID is not found
The ReadingOrder might contain regions of types other than text regions,
so not finding a TextRegion with the referenced ID is not an error.
Downgrade to a warning for now.
5 years ago
Gerber, Mike 8237b3edaf � dinglehopper: Substitute more characters 5 years ago
Gerber, Mike 02a0e093bf dinglehopper: Add OCR-D interface 5 years ago
Gerber, Mike 495919c06d 🧹 dinglehopper: Move pytest.ini 5 years ago
Gerber, Mike 89048bf55d ➡ Move dinglehopper into its own directory 5 years ago