Max Bachmann
a1f0a5e2d3
replace uniseg with uniseg2
2 years ago
Max Bachmann
22c3817f45
apply black
2 years ago
Max Bachmann
01571f23b7
move grapheme clusters to ExtractedText
2 years ago
Max Bachmann
f211d09f56
remove python2.7 futures
2 years ago
Max Bachmann
205a969c0e
remove unused includes
2 years ago
Max Bachmann
f3825cdeb6
only call `words_normalized` once
2 years ago
Gerber, Mike
dcc10c5389
✔️ Skip test_lines_similar() for now
...
test_lines_similar() fails with rapidfuzz 2.5 and is flawed anyway:
The test was based on our own implementation that used __eq__ and not __hash__ as
rapidfuzz does. Need to review this in the future.
2 years ago
Gerber, Mike
555f586775
📝 Note that old terminals might not render the Unicode characters correctly
2 years ago
Gerber, Mike
c4e85da5ab
🐛 Update editops() and seq_align() due to RapidFuzz API changes
2 years ago
Gerber, Mike
15dfbac3a7
Revert "Revert "Merge pull request #67 from maxbachmann/rapidfuzz""
...
This reverts commit 76bd50f1db
.
2 years ago
Gerber, Mike
76bd50f1db
Revert "Merge pull request #67 from maxbachmann/rapidfuzz"
...
This reverts commit 85f751aacc
, reversing
changes made to 1febea8c92
.
2 years ago
Max Bachmann
e543438496
replace usage of deprecated rapidfuzz APIs
2 years ago
Gerber, Mike
d726396002
👷🏾♂️ Remove str() on Path objects
...
As of Python 3.6 we don't need to call str() on Path objects anymore.
See also gh-20.
3 years ago
Gerber, Mike
8a3f5e48c2
🐛 dinglehopper: Patch word_break only once
...
continuous-integration/drone/push Build encountered an error
Details
Previously, we (accidently) patched uniseg's word_break on every call
to words(). Do it only once.
3 years ago
Gerber, Mike
f77ce857b2
🚧 dinglehopper: Sahre json_float code
continuous-integration/drone/push Build encountered an error
Details
3 years ago
Gerber, Mike
5b394649a7
🚧 dinglehopper: Compute WER in line-dirs CLI
3 years ago
Gerber, Mike
cb2be96179
🚧 dinglehopper: Add word differences in line-dirs report
3 years ago
Gerber, Mike
dbb660615a
🚧 dinglehopper: Compare line text directories (WIP)
continuous-integration/drone/push Build encountered an error
Details
3 years ago
Gerber, Mike
a018006f98
🚧 dinglehopper: Compare line text directories (WIP)
3 years ago
Gerber, Mike
36b36f6986
🚧 dinglehopper: Compare line text directories (WIP)
3 years ago
Gerber, Mike
06ea38449c
📝 dinglehopper: Update Levenshtein notebook
3 years ago
Gerber, Mike
3ee688001a
🧹 dinglehopper: Directly import levenshtein() from rapidfuzz
3 years ago
Gerber, Mike
5d496df267
⚡ dinglehopper: Remove tests that only test rapidfuzz's levenshtein()
3 years ago
Gerber, Mike
091f069b3c
⚡ dinglehopper: Remove tests that only test rapidfuzz's levenshtein_ops()
3 years ago
Gerber, Mike
af8da1d716
⚡ dinglehopper: Use rapidfuzz for editops
3 years ago
Gerber, Mike
249787686f
Merge branch 'master' of github.com:qurator-spk/dinglehopper
continuous-integration/drone/push Build is failing
Details
4 years ago
Gerber, Mike
2a6cc5823e
🐛 dinglehopper: Call initLogging before logging
...
When using ocrd_utils' getLogger(), we need to call initLogging() before doing any
logging.
Fixes #55 .
4 years ago
Konstantin Baierer
7fde00d911
ReadingOrder may also contain UnorderedGroupIndexed
4 years ago
Gerber, Mike
1778b36a9a
🚧 dinglehopper: Read PAGE UnorderedGroup in XML order
4 years ago
Benjamin Rosemann
a68fc269d9
Fix the extraction of text from Page with TableRegion
...
Dinglehopper did not consider `OrderedGroupIndex` in the `ReadingOrder`
element when extracting text regions. As a consequence a `TableRegion`
was not considered for text extraction.
4 years ago
Konstantin Baierer
74e0ac18ed
ocrd cli: use core-provided zip_input_files method
4 years ago
Gerber, Mike
389e253c11
🐛 dinglehopper: Fix alto_extract_lines()'s type annotation
4 years ago
Gerber, Mike
fe3923a8af
🐛 dinglehopper: Fix alto_extract()'s type annotation
4 years ago
Gerber, Mike
132f91d500
✔️ dinglehopper: Add missing integration test markers
4 years ago
Benjamin Rosemann
ce752e1912
Remove .idea folder and modify .gitignore
...
Sharing even parts of the .idea folder in worldwide setting is bound to
generate more problems than solutions. Therefore it should be removed
and consequently ignore in .gitignore.
Also adds some Python specific stuff to the .gitignore file.
4 years ago
Benjamin Rosemann
5270737c1f
Skip test on windows because it is unix specific.
4 years ago
Gerber, Mike
32a4b95a99
🐛 dinglehopper: Normalize in plain_extract()
4 years ago
Gerber, Mike
14421c8e53
🎨 dinglehopper: Reformat using black
4 years ago
Gerber, Mike
31c63f9e4c
🎨 dinglehopper: s/LOG/log
4 years ago
Robert Sachunsky
a60c14351e
1 more update for core's getLogger context
4 years ago
Benjamin Rosemann
c02569b41e
Fix f-strings for Python 3.5
4 years ago
Benjamin Rosemann
7b27b2834e
More complex sorting for text extraction
...
When extracting text from TextEquiv nodes we may encounter nodes without
index or nodes that should get sorted via the conf attribute.
Therefore we added a more complex algorithm to extract a TextEquiv and
inform the user via log messages if we encounter structures that we can
handle but may produce unexpected results.
4 years ago
Benjamin Rosemann
6ff831dfd2
Sort textlines with missing indices
...
Python's `sorted` method will fail with a TypeError when called with
`None` and Integers:
```python
>>> sorted([None, 1])
TypeError: '<' not supported between instances of 'int' and 'NoneType'
```
Therefore we are using `float('inf')` instead of `None` in case of
missing textline indices.
4 years ago
Gerber, Mike
5cbe148741
🐛 dinglehopper: Skip pages if there is no GT nor OCR (Fixes GH-34)
4 years ago
Gerber, Mike
e4e2777cb7
🐛 dinglehopper: Do try to get text when no TextEquivs exist
4 years ago
Gerber, Mike
1c88891a98
✔️ Add test data for LAREX's indexed TextEquivs (unused)
4 years ago
Gerber, Mike
19d15e3ecc
🐛 dinglehopper: Honor TextEquiv index (Closes GH-33)
4 years ago
Gerber, Mike
f626a2ebe6
🧹 dinglehopper: Remove warning when there is a non-TextRegion in the ReadingOrder
4 years ago
Gerber, Mike
8b4ee20a40
✨ Add a new CLI tool dinglehopper-extract to just give the extracted text
4 years ago
Gerber, Mike
b23b75b601
✨ dinglehopper: Give segment ids from the extracted textequiv_level
4 years ago