Commit Graph

121 Commits (edc24cd4db36f9c8c600b694fe6c80a33ff0cd65)

Author SHA1 Message Date
Gerber, Mike d726396002 👷🏾‍♂️ Remove str() on Path objects
As of Python 3.6 we don't need to call str() on Path objects anymore.

See also gh-20.
3 years ago
Gerber, Mike 8a3f5e48c2 🐛 dinglehopper: Patch word_break only once
continuous-integration/drone/push Build encountered an error Details
Previously, we (accidently) patched uniseg's word_break on every call
to words(). Do it only once.
3 years ago
Gerber, Mike f77ce857b2 🚧 dinglehopper: Sahre json_float code
continuous-integration/drone/push Build encountered an error Details
3 years ago
Gerber, Mike 5b394649a7 🚧 dinglehopper: Compute WER in line-dirs CLI 3 years ago
Gerber, Mike cb2be96179 🚧 dinglehopper: Add word differences in line-dirs report 3 years ago
Gerber, Mike dbb660615a 🚧 dinglehopper: Compare line text directories (WIP)
continuous-integration/drone/push Build encountered an error Details
3 years ago
Gerber, Mike a018006f98 🚧 dinglehopper: Compare line text directories (WIP) 3 years ago
Gerber, Mike 36b36f6986 🚧 dinglehopper: Compare line text directories (WIP) 3 years ago
Gerber, Mike 06ea38449c 📝 dinglehopper: Update Levenshtein notebook 3 years ago
Gerber, Mike 3ee688001a 🧹 dinglehopper: Directly import levenshtein() from rapidfuzz 3 years ago
Gerber, Mike 5d496df267 dinglehopper: Remove tests that only test rapidfuzz's levenshtein() 3 years ago
Gerber, Mike 091f069b3c dinglehopper: Remove tests that only test rapidfuzz's levenshtein_ops() 3 years ago
Gerber, Mike af8da1d716 dinglehopper: Use rapidfuzz for editops 3 years ago
Gerber, Mike 249787686f Merge branch 'master' of github.com:qurator-spk/dinglehopper
continuous-integration/drone/push Build is failing Details
4 years ago
Gerber, Mike 2a6cc5823e 🐛 dinglehopper: Call initLogging before logging
When using ocrd_utils' getLogger(), we need to call initLogging() before doing any
logging.

Fixes #55.
4 years ago
Konstantin Baierer 7fde00d911 ReadingOrder may also contain UnorderedGroupIndexed 4 years ago
Gerber, Mike 1778b36a9a 🚧 dinglehopper: Read PAGE UnorderedGroup in XML order 4 years ago
Benjamin Rosemann a68fc269d9 Fix the extraction of text from Page with TableRegion
Dinglehopper did not consider `OrderedGroupIndex` in the `ReadingOrder`
element when extracting text regions. As a consequence a `TableRegion`
was not considered for text extraction.
4 years ago
Konstantin Baierer 74e0ac18ed ocrd cli: use core-provided zip_input_files method 4 years ago
Gerber, Mike 389e253c11 🐛 dinglehopper: Fix alto_extract_lines()'s type annotation 4 years ago
Gerber, Mike fe3923a8af 🐛 dinglehopper: Fix alto_extract()'s type annotation 4 years ago
Gerber, Mike 132f91d500 ✔️ dinglehopper: Add missing integration test markers 4 years ago
Benjamin Rosemann ce752e1912 Remove .idea folder and modify .gitignore
Sharing even parts of the .idea folder in worldwide setting is bound to
generate more problems than solutions. Therefore it should be removed
and consequently ignore in .gitignore.

Also adds some Python specific stuff to the .gitignore file.
4 years ago
Benjamin Rosemann 5270737c1f Skip test on windows because it is unix specific. 4 years ago
Gerber, Mike 32a4b95a99 🐛 dinglehopper: Normalize in plain_extract() 4 years ago
Gerber, Mike 14421c8e53 🎨 dinglehopper: Reformat using black 4 years ago
Gerber, Mike 31c63f9e4c 🎨 dinglehopper: s/LOG/log 4 years ago
Robert Sachunsky a60c14351e
1 more update for core's getLogger context 4 years ago
Benjamin Rosemann c02569b41e Fix f-strings for Python 3.5 4 years ago
Benjamin Rosemann 7b27b2834e More complex sorting for text extraction
When extracting text from TextEquiv nodes we may encounter nodes without
index or nodes that should get sorted via the conf attribute.

Therefore we added a more complex algorithm to extract a TextEquiv and
inform the user via log messages if we encounter structures that we can
handle but may produce unexpected results.
4 years ago
Benjamin Rosemann 6ff831dfd2 Sort textlines with missing indices
Python's `sorted` method will fail with a TypeError when called with
`None` and Integers:

```python
>>> sorted([None, 1])
TypeError: '<' not supported between instances of 'int' and 'NoneType'
```
Therefore we are using `float('inf')` instead of `None` in case of
missing textline indices.
4 years ago
Gerber, Mike 5cbe148741 🐛 dinglehopper: Skip pages if there is no GT nor OCR (Fixes GH-34) 4 years ago
Gerber, Mike e4e2777cb7 🐛 dinglehopper: Do try to get text when no TextEquivs exist 4 years ago
Gerber, Mike 1c88891a98 ✔️ Add test data for LAREX's indexed TextEquivs (unused) 4 years ago
Gerber, Mike 19d15e3ecc 🐛 dinglehopper: Honor TextEquiv index (Closes GH-33) 4 years ago
Gerber, Mike f626a2ebe6 🧹 dinglehopper: Remove warning when there is a non-TextRegion in the ReadingOrder 4 years ago
Gerber, Mike 8b4ee20a40 Add a new CLI tool dinglehopper-extract to just give the extracted text 4 years ago
Gerber, Mike b23b75b601 dinglehopper: Give segment ids from the extracted textequiv_level 4 years ago
Gerber, Mike b23e4ce30e dinglehopper: Add OCR-D parameter to choose TextEquiv level 4 years ago
Gerber, Mike 9744fa2567 dinglehopper: Add CLI option to choose TextEquiv level 4 years ago
Gerber, Mike 75733039b8 🧹 dinglehopper: Do not hardcode joiner to \n 4 years ago
Gerber, Mike 3848412349 dinglehopper: Implement the basic text extraction from PAGE TextLines 4 years ago
Gerber, Mike f2367ac0c3 🐛 Fix OCR-D CLI for newest OCR-D
Now that find_files() is a generator, we can't use [0] to get the file.
4 years ago
Gerber, Mike 5ed184c8c4 dinglehopper: Show a progressbar on --progress 4 years ago
Gerber, Mike 4951823a29 🧹 dinglehopper: Disable metrics in JSON report, too 4 years ago
Gerber, Mike 82217a25bb 🧹 dinglehopper: Move all normalization code to extracted_text.py 4 years ago
Gerber, Mike c6c6b8efab 📝 dinglehopper: Add detail about the text extraction and ExtractedText 4 years ago
Gerber, Mike f50591abac Merge branch 'feat/display-segment-id' 4 years ago
Gerber, Mike c514abfb9f 🧹 dinglehopper: Sanitize imports 4 years ago
Gerber, Mike 1077dc64ce ➡️ dinglehopper: Move ExtractedText to its own file 4 years ago