Commit Graph

462 Commits (master)
 

Author SHA1 Message Date
Benjamin Rosemann 5270737c1f Skip test on windows because it is unix specific. 4 years ago
Gerber, Mike 32a4b95a99 🐛 dinglehopper: Normalize in plain_extract() 4 years ago
Gerber, Mike 14421c8e53 🎨 dinglehopper: Reformat using black 4 years ago
Gerber, Mike 31c63f9e4c 🎨 dinglehopper: s/LOG/log 4 years ago
Mike Gerber 0804b029c4
Merge pull request #43 from bertsky/patch-1
1 more update for core's getLogger context
4 years ago
Robert Sachunsky a60c14351e
1 more update for core's getLogger context 4 years ago
Mike Gerber a51f0b3dcd
Merge pull request #42 from b2m/test-python-cache-for-travis
Add travis pip caching
4 years ago
Benjamin Rosemann b10af9f138 Test travis pip caching 4 years ago
Mike Gerber 089f6d299e
Merge pull request #37 from b2m/fix-sort-with-none
Sort textlines with missing indices
4 years ago
Mike Gerber 5138a1de21
Merge pull request #39 from b2m/test-python-3.9
Add Python 3.9 to .travis.yml
4 years ago
Benjamin Rosemann c02569b41e Fix f-strings for Python 3.5 4 years ago
Benjamin Rosemann 7b27b2834e More complex sorting for text extraction
When extracting text from TextEquiv nodes we may encounter nodes without
index or nodes that should get sorted via the conf attribute.

Therefore we added a more complex algorithm to extract a TextEquiv and
inform the user via log messages if we encounter structures that we can
handle but may produce unexpected results.
4 years ago
Benjamin Rosemann 6ff831dfd2 Sort textlines with missing indices
Python's `sorted` method will fail with a TypeError when called with
`None` and Integers:

```python
>>> sorted([None, 1])
TypeError: '<' not supported between instances of 'int' and 'NoneType'
```
Therefore we are using `float('inf')` instead of `None` in case of
missing textline indices.
4 years ago
Benjamin Rosemann e77f19fefc Add Python 3.9 to .travis.yml 4 years ago
Mike Gerber 082fc9e09a
Merge pull request #38 from b2m/add-editorconfig
Add .editorconfig
4 years ago
Benjamin Rosemann 20661487d6 Add .editorconfig
Add a proposal for a .editorconfig file (see https://editorconfig.org/).
This is natively supported by a lot of editors, others are supported via
plugins.

This will close #19.
4 years ago
Gerber, Mike 6e47acda1c 📝 dinglehopper: Move screenshot higher 4 years ago
Gerber, Mike 5cbe148741 🐛 dinglehopper: Skip pages if there is no GT nor OCR (Fixes GH-34) 4 years ago
Gerber, Mike e4e2777cb7 🐛 dinglehopper: Do try to get text when no TextEquivs exist 4 years ago
Gerber, Mike f14ae46870 Merge branch 'feat/text-extraction-levels' 4 years ago
Gerber, Mike 1c88891a98 ✔️ Add test data for LAREX's indexed TextEquivs (unused) 4 years ago
Gerber, Mike 19d15e3ecc 🐛 dinglehopper: Honor TextEquiv index (Closes GH-33) 4 years ago
Gerber, Mike f626a2ebe6 🧹 dinglehopper: Remove warning when there is a non-TextRegion in the ReadingOrder 4 years ago
Gerber, Mike 0f3857d8d3 📝 Document OCR-D parameters and restructure README a bit 4 years ago
Gerber, Mike 8b4ee20a40 Add a new CLI tool dinglehopper-extract to just give the extracted text 4 years ago
Gerber, Mike b23b75b601 dinglehopper: Give segment ids from the extracted textequiv_level 4 years ago
Gerber, Mike b23e4ce30e dinglehopper: Add OCR-D parameter to choose TextEquiv level 4 years ago
Gerber, Mike 9744fa2567 dinglehopper: Add CLI option to choose TextEquiv level 4 years ago
Gerber, Mike 75733039b8 🧹 dinglehopper: Do not hardcode joiner to \n 4 years ago
Gerber, Mike 3848412349 dinglehopper: Implement the basic text extraction from PAGE TextLines 4 years ago
Gerber, Mike f2367ac0c3 🐛 Fix OCR-D CLI for newest OCR-D
Now that find_files() is a generator, we can't use [0] to get the file.
4 years ago
Gerber, Mike 5ed184c8c4 dinglehopper: Show a progressbar on --progress 4 years ago
Gerber, Mike 4951823a29 🧹 dinglehopper: Disable metrics in JSON report, too 4 years ago
Gerber, Mike 5303eea80c 📝 dinglehopper: Update README to use OCR-D's new and more readable -P option 4 years ago
Gerber, Mike 82217a25bb 🧹 dinglehopper: Move all normalization code to extracted_text.py 4 years ago
Gerber, Mike 009fa55c09 Merge branch 'master' of https://github.com/qurator-spk/dinglehopper 4 years ago
Gerber, Mike c20bbbfa25 📝 dinglehopper: Update screenshot to include a region id tooltip 4 years ago
Mike Gerber 252bf9b3e7
📝 dinglehopper: Fix markdown in README.md 4 years ago
Gerber, Mike c6c6b8efab 📝 dinglehopper: Add detail about the text extraction and ExtractedText 4 years ago
Gerber, Mike 7025ea54a8 📝 dinglehopper: Move developer info to README-DEV.md 4 years ago
Gerber, Mike f50591abac Merge branch 'feat/display-segment-id' 4 years ago
Gerber, Mike c514abfb9f 🧹 dinglehopper: Sanitize imports 4 years ago
Gerber, Mike 1077dc64ce ➡️ dinglehopper: Move ExtractedText to its own file 4 years ago
Gerber, Mike 9dd4ff0aae dinglehopper: Extract line IDs for ALTO 4 years ago
Gerber, Mike f3aafb6fdf dinglehopper: Validate ExtractedText.{segments,_text} in both directions 4 years ago
Gerber, Mike 1f9a680fe7 ⚙️ dinglehopper: PyCharm should use dinglehopper-github virtualenv 4 years ago
Gerber, Mike b14c35e147 🎨 dinglehopper: Use multimethod to handle str vs ExtractedText 4 years ago
Gerber, Mike a17ee2afec 🚧 dinglehopper: Guarantee NFC + rename from_text → from_str 4 years ago
Gerber, Mike 7843824eaf 🚧 dinglehopper: Support str & ExtractedText in CER and distance functions 4 years ago
Gerber, Mike 5bee55c896 💩 dinglehopper: Fix OCR-D CLI test by working around ocrd_cli_wrap_processor() check for arguments 4 years ago