Benjamin Rosemann
7b27b2834e
More complex sorting for text extraction
...
When extracting text from TextEquiv nodes we may encounter nodes without
index or nodes that should get sorted via the conf attribute.
Therefore we added a more complex algorithm to extract a TextEquiv and
inform the user via log messages if we encounter structures that we can
handle but may produce unexpected results.
4 years ago
Benjamin Rosemann
6ff831dfd2
Sort textlines with missing indices
...
Python's `sorted` method will fail with a TypeError when called with
`None` and Integers:
```python
>>> sorted([None, 1])
TypeError: '<' not supported between instances of 'int' and 'NoneType'
```
Therefore we are using `float('inf')` instead of `None` in case of
missing textline indices.
4 years ago
Benjamin Rosemann
e77f19fefc
Add Python 3.9 to .travis.yml
4 years ago
Mike Gerber
082fc9e09a
Merge pull request #38 from b2m/add-editorconfig
...
Add .editorconfig
4 years ago
Benjamin Rosemann
20661487d6
Add .editorconfig
...
Add a proposal for a .editorconfig file (see https://editorconfig.org/ ).
This is natively supported by a lot of editors, others are supported via
plugins.
This will close #19 .
4 years ago
Gerber, Mike
6e47acda1c
📝 dinglehopper: Move screenshot higher
4 years ago
Gerber, Mike
5cbe148741
🐛 dinglehopper: Skip pages if there is no GT nor OCR (Fixes GH-34)
4 years ago
Gerber, Mike
e4e2777cb7
🐛 dinglehopper: Do try to get text when no TextEquivs exist
4 years ago
Gerber, Mike
f14ae46870
Merge branch 'feat/text-extraction-levels'
4 years ago
Gerber, Mike
1c88891a98
✔️ Add test data for LAREX's indexed TextEquivs (unused)
4 years ago
Gerber, Mike
19d15e3ecc
🐛 dinglehopper: Honor TextEquiv index (Closes GH-33)
4 years ago
Gerber, Mike
f626a2ebe6
🧹 dinglehopper: Remove warning when there is a non-TextRegion in the ReadingOrder
4 years ago
Gerber, Mike
0f3857d8d3
📝 Document OCR-D parameters and restructure README a bit
4 years ago
Gerber, Mike
8b4ee20a40
✨ Add a new CLI tool dinglehopper-extract to just give the extracted text
4 years ago
Gerber, Mike
b23b75b601
✨ dinglehopper: Give segment ids from the extracted textequiv_level
4 years ago
Gerber, Mike
b23e4ce30e
✨ dinglehopper: Add OCR-D parameter to choose TextEquiv level
4 years ago
Gerber, Mike
9744fa2567
✨ dinglehopper: Add CLI option to choose TextEquiv level
4 years ago
Gerber, Mike
75733039b8
🧹 dinglehopper: Do not hardcode joiner to \n
4 years ago
Gerber, Mike
3848412349
✨ dinglehopper: Implement the basic text extraction from PAGE TextLines
4 years ago
Gerber, Mike
f2367ac0c3
🐛 Fix OCR-D CLI for newest OCR-D
...
Now that find_files() is a generator, we can't use [0] to get the file.
4 years ago
Gerber, Mike
5ed184c8c4
✨ dinglehopper: Show a progressbar on --progress
4 years ago
Gerber, Mike
4951823a29
🧹 dinglehopper: Disable metrics in JSON report, too
4 years ago
Gerber, Mike
5303eea80c
📝 dinglehopper: Update README to use OCR-D's new and more readable -P option
4 years ago
Gerber, Mike
82217a25bb
🧹 dinglehopper: Move all normalization code to extracted_text.py
4 years ago
Gerber, Mike
009fa55c09
Merge branch 'master' of https://github.com/qurator-spk/dinglehopper
4 years ago
Gerber, Mike
c20bbbfa25
📝 dinglehopper: Update screenshot to include a region id tooltip
4 years ago
Mike Gerber
252bf9b3e7
📝 dinglehopper: Fix markdown in README.md
4 years ago
Gerber, Mike
c6c6b8efab
📝 dinglehopper: Add detail about the text extraction and ExtractedText
4 years ago
Gerber, Mike
7025ea54a8
📝 dinglehopper: Move developer info to README-DEV.md
4 years ago
Gerber, Mike
f50591abac
Merge branch 'feat/display-segment-id'
4 years ago
Gerber, Mike
c514abfb9f
🧹 dinglehopper: Sanitize imports
4 years ago
Gerber, Mike
1077dc64ce
➡️ dinglehopper: Move ExtractedText to its own file
4 years ago
Gerber, Mike
9dd4ff0aae
✨ dinglehopper: Extract line IDs for ALTO
4 years ago
Gerber, Mike
f3aafb6fdf
✨ dinglehopper: Validate ExtractedText.{segments,_text} in both directions
4 years ago
Gerber, Mike
1f9a680fe7
⚙️ dinglehopper: PyCharm should use dinglehopper-github virtualenv
4 years ago
Gerber, Mike
b14c35e147
🎨 dinglehopper: Use multimethod to handle str vs ExtractedText
4 years ago
Gerber, Mike
a17ee2afec
🚧 dinglehopper: Guarantee NFC + rename from_text → from_str
4 years ago
Gerber, Mike
7843824eaf
🚧 dinglehopper: Support str & ExtractedText in CER and distance functions
4 years ago
Gerber, Mike
5bee55c896
💩 dinglehopper: Fix OCR-D CLI test by working around ocrd_cli_wrap_processor() check for arguments
4 years ago
Gerber, Mike
96b55f1806
🚧 dinglehopper: Hierarchical text representation
4 years ago
Gerber, Mike
db6292611f
🧹 dinglehopper: Remove merged text extraction test code
4 years ago
Gerber, Mike
d706ef4621
📝 Document CER/WER and the format detection (Fixes GH-26)
4 years ago
Gerber, Mike
da47e41c85
💩 dinglehopper: Fix OCR-D CLI test by working around ocrd_cli_wrap_processor() check for arguments
4 years ago
Mike Gerber
7085ee0fd8
Merge pull request #29 from kba/getlogger
...
getLogger per method
4 years ago
Gerber, Mike
77154ef256
📝 dinglehopper: Document REPORT_PREFIX (Closes GH-27.)
4 years ago
Gerber, Mike
829b84c66a
⚙️ dinglehopper: Add PyCharm's vcs.xml to git
4 years ago
Konstantin Baierer
12da98e477
getLogger per method
4 years ago
Gerber, Mike
717801bdbb
Merge commit '7930ecd42868cb6785a58f8ee95b05882704621d'
4 years ago
Gerber, Mike
7930ecd428
Merge branch 'master' of https://github.com/qurator-spk/dinglehopper
4 years ago
Gerber, Mike
976a042b2b
🔧 dinglehopper: Add PyCharm code style config
4 years ago