Commit Graph

292 Commits (35be58cb9456b0893bc46640b234912148621fb6)
 

Author SHA1 Message Date
Gerber, Mike af8da1d716 dinglehopper: Use rapidfuzz for editops 3 years ago
Gerber, Mike 249787686f Merge branch 'master' of github.com:qurator-spk/dinglehopper
continuous-integration/drone/push Build is failing Details
4 years ago
Gerber, Mike 2a6cc5823e 🐛 dinglehopper: Call initLogging before logging
When using ocrd_utils' getLogger(), we need to call initLogging() before doing any
logging.

Fixes #55.
4 years ago
Mike Gerber 0b9af3a21e
Merge pull request #58 from kba/unorderedgroupindexed
continuous-integration/drone/push Build is passing Details
ReadingOrder may also contain UnorderedGroupIndexed
4 years ago
Konstantin Baierer 7fde00d911 ReadingOrder may also contain UnorderedGroupIndexed 4 years ago
Gerber, Mike 1778b36a9a 🚧 dinglehopper: Read PAGE UnorderedGroup in XML order 4 years ago
Gerber, Mike bd324331e6 🚧 dinglehopper: Try out Drone CI
continuous-integration/drone/push Build is passing Details
4 years ago
Gerber, Mike a59ecb795c 🚧 dinglehopper: Try out Drone CI
continuous-integration/drone/push Build is failing Details
4 years ago
Gerber, Mike 14230e073a 🚧 dinglehopper: Try out Drone CI 4 years ago
Gerber, Mike 985666a71c 🚧 dinglehopper: Try out Drone CI 4 years ago
Gerber, Mike 4a73053cfc 🚧 Replace Travis with CircleCI 4 years ago
Gerber, Mike e3d4493c82 🚧 Replace Travis with CircleCI 4 years ago
Gerber, Mike 27f4c3bdf8 🚧 Replace Travis with CircleCI 4 years ago
Gerber, Mike 8533e6d421 🚧 Replace Travis with CircleCI 4 years ago
Gerber, Mike e8da8b63f8 🚧 Replace Travis with CircleCI 4 years ago
Gerber, Mike 3b7a1a5631 🚧 Replace Travis with CircleCI 4 years ago
Mike Gerber 691ce371ca
Merge pull request #50 from b2m/fix-table-extraction
Fix the extraction of text from Page with TableRegion
4 years ago
Benjamin Rosemann a68fc269d9 Fix the extraction of text from Page with TableRegion
Dinglehopper did not consider `OrderedGroupIndex` in the `ReadingOrder`
element when extracting text regions. As a consequence a `TableRegion`
was not considered for text extraction.
4 years ago
Gerber, Mike 8cd8314c8a 🐛 dinglehopper: Bump up ocrd req for zip_input_files
See also GH-49.
4 years ago
Mike Gerber 62670dd0c7
Merge pull request #49 from kba/zip_input_files
ocrd cli: use core-provided zip_input_files method
4 years ago
Konstantin Baierer 74e0ac18ed ocrd cli: use core-provided zip_input_files method 4 years ago
Gerber, Mike 389e253c11 🐛 dinglehopper: Fix alto_extract_lines()'s type annotation 4 years ago
Gerber, Mike fe3923a8af 🐛 dinglehopper: Fix alto_extract()'s type annotation 4 years ago
Gerber, Mike 132f91d500 ✔️ dinglehopper: Add missing integration test markers 4 years ago
Gerber, Mike c48d7646df 📝 dinglehopper: README-DEV: Massage markdown a bit 4 years ago
Mike Gerber fed021090d
Merge pull request #46 from b2m/tool-changes
Tool changes
4 years ago
Benjamin Rosemann cb1ac9d260 Add black to developer requirements. 4 years ago
Benjamin Rosemann 03ad413f4a Added some helpful tools and configurations 4 years ago
Benjamin Rosemann 5cbd4f3d95 Preparation for black code formatter 4 years ago
Benjamin Rosemann ce752e1912 Remove .idea folder and modify .gitignore
Sharing even parts of the .idea folder in worldwide setting is bound to
generate more problems than solutions. Therefore it should be removed
and consequently ignore in .gitignore.

Also adds some Python specific stuff to the .gitignore file.
4 years ago
Benjamin Rosemann 5270737c1f Skip test on windows because it is unix specific. 4 years ago
Gerber, Mike 32a4b95a99 🐛 dinglehopper: Normalize in plain_extract() 4 years ago
Gerber, Mike 14421c8e53 🎨 dinglehopper: Reformat using black 4 years ago
Gerber, Mike 31c63f9e4c 🎨 dinglehopper: s/LOG/log 4 years ago
Mike Gerber 0804b029c4
Merge pull request #43 from bertsky/patch-1
1 more update for core's getLogger context
4 years ago
Robert Sachunsky a60c14351e
1 more update for core's getLogger context 4 years ago
Mike Gerber a51f0b3dcd
Merge pull request #42 from b2m/test-python-cache-for-travis
Add travis pip caching
4 years ago
Benjamin Rosemann b10af9f138 Test travis pip caching 4 years ago
Mike Gerber 089f6d299e
Merge pull request #37 from b2m/fix-sort-with-none
Sort textlines with missing indices
4 years ago
Mike Gerber 5138a1de21
Merge pull request #39 from b2m/test-python-3.9
Add Python 3.9 to .travis.yml
4 years ago
Benjamin Rosemann c02569b41e Fix f-strings for Python 3.5 4 years ago
Benjamin Rosemann 7b27b2834e More complex sorting for text extraction
When extracting text from TextEquiv nodes we may encounter nodes without
index or nodes that should get sorted via the conf attribute.

Therefore we added a more complex algorithm to extract a TextEquiv and
inform the user via log messages if we encounter structures that we can
handle but may produce unexpected results.
4 years ago
Benjamin Rosemann 6ff831dfd2 Sort textlines with missing indices
Python's `sorted` method will fail with a TypeError when called with
`None` and Integers:

```python
>>> sorted([None, 1])
TypeError: '<' not supported between instances of 'int' and 'NoneType'
```
Therefore we are using `float('inf')` instead of `None` in case of
missing textline indices.
4 years ago
Benjamin Rosemann e77f19fefc Add Python 3.9 to .travis.yml 4 years ago
Mike Gerber 082fc9e09a
Merge pull request #38 from b2m/add-editorconfig
Add .editorconfig
4 years ago
Benjamin Rosemann 20661487d6 Add .editorconfig
Add a proposal for a .editorconfig file (see https://editorconfig.org/).
This is natively supported by a lot of editors, others are supported via
plugins.

This will close #19.
4 years ago
Gerber, Mike 6e47acda1c 📝 dinglehopper: Move screenshot higher 4 years ago
Gerber, Mike 5cbe148741 🐛 dinglehopper: Skip pages if there is no GT nor OCR (Fixes GH-34) 4 years ago
Gerber, Mike e4e2777cb7 🐛 dinglehopper: Do try to get text when no TextEquivs exist 4 years ago
Gerber, Mike f14ae46870 Merge branch 'feat/text-extraction-levels' 4 years ago