Mike Gerber
691ce371ca
Merge pull request #50 from b2m/fix-table-extraction
...
Fix the extraction of text from Page with TableRegion
4 years ago
Benjamin Rosemann
a68fc269d9
Fix the extraction of text from Page with TableRegion
...
Dinglehopper did not consider `OrderedGroupIndex` in the `ReadingOrder`
element when extracting text regions. As a consequence a `TableRegion`
was not considered for text extraction.
4 years ago
Gerber, Mike
8cd8314c8a
🐛 dinglehopper: Bump up ocrd req for zip_input_files
...
See also GH-49.
4 years ago
Mike Gerber
62670dd0c7
Merge pull request #49 from kba/zip_input_files
...
ocrd cli: use core-provided zip_input_files method
4 years ago
Konstantin Baierer
74e0ac18ed
ocrd cli: use core-provided zip_input_files method
4 years ago
Gerber, Mike
389e253c11
🐛 dinglehopper: Fix alto_extract_lines()'s type annotation
4 years ago
Gerber, Mike
fe3923a8af
🐛 dinglehopper: Fix alto_extract()'s type annotation
4 years ago
Gerber, Mike
132f91d500
✔️ dinglehopper: Add missing integration test markers
4 years ago
Gerber, Mike
c48d7646df
📝 dinglehopper: README-DEV: Massage markdown a bit
4 years ago
Mike Gerber
fed021090d
Merge pull request #46 from b2m/tool-changes
...
Tool changes
4 years ago
Benjamin Rosemann
cb1ac9d260
Add black to developer requirements.
4 years ago
Benjamin Rosemann
03ad413f4a
Added some helpful tools and configurations
4 years ago
Benjamin Rosemann
5cbd4f3d95
Preparation for black code formatter
4 years ago
Benjamin Rosemann
ce752e1912
Remove .idea folder and modify .gitignore
...
Sharing even parts of the .idea folder in worldwide setting is bound to
generate more problems than solutions. Therefore it should be removed
and consequently ignore in .gitignore.
Also adds some Python specific stuff to the .gitignore file.
4 years ago
Benjamin Rosemann
5270737c1f
Skip test on windows because it is unix specific.
4 years ago
Gerber, Mike
32a4b95a99
🐛 dinglehopper: Normalize in plain_extract()
4 years ago
Gerber, Mike
14421c8e53
🎨 dinglehopper: Reformat using black
4 years ago
Gerber, Mike
31c63f9e4c
🎨 dinglehopper: s/LOG/log
4 years ago
Mike Gerber
0804b029c4
Merge pull request #43 from bertsky/patch-1
...
1 more update for core's getLogger context
4 years ago
Robert Sachunsky
a60c14351e
1 more update for core's getLogger context
4 years ago
Mike Gerber
a51f0b3dcd
Merge pull request #42 from b2m/test-python-cache-for-travis
...
Add travis pip caching
4 years ago
Benjamin Rosemann
b10af9f138
Test travis pip caching
4 years ago
Mike Gerber
089f6d299e
Merge pull request #37 from b2m/fix-sort-with-none
...
Sort textlines with missing indices
4 years ago
Mike Gerber
5138a1de21
Merge pull request #39 from b2m/test-python-3.9
...
Add Python 3.9 to .travis.yml
4 years ago
Benjamin Rosemann
c02569b41e
Fix f-strings for Python 3.5
4 years ago
Benjamin Rosemann
7b27b2834e
More complex sorting for text extraction
...
When extracting text from TextEquiv nodes we may encounter nodes without
index or nodes that should get sorted via the conf attribute.
Therefore we added a more complex algorithm to extract a TextEquiv and
inform the user via log messages if we encounter structures that we can
handle but may produce unexpected results.
4 years ago
Benjamin Rosemann
6ff831dfd2
Sort textlines with missing indices
...
Python's `sorted` method will fail with a TypeError when called with
`None` and Integers:
```python
>>> sorted([None, 1])
TypeError: '<' not supported between instances of 'int' and 'NoneType'
```
Therefore we are using `float('inf')` instead of `None` in case of
missing textline indices.
4 years ago
Benjamin Rosemann
e77f19fefc
Add Python 3.9 to .travis.yml
4 years ago
Mike Gerber
082fc9e09a
Merge pull request #38 from b2m/add-editorconfig
...
Add .editorconfig
4 years ago
Benjamin Rosemann
20661487d6
Add .editorconfig
...
Add a proposal for a .editorconfig file (see https://editorconfig.org/ ).
This is natively supported by a lot of editors, others are supported via
plugins.
This will close #19 .
4 years ago
Gerber, Mike
6e47acda1c
📝 dinglehopper: Move screenshot higher
4 years ago
Gerber, Mike
5cbe148741
🐛 dinglehopper: Skip pages if there is no GT nor OCR (Fixes GH-34)
4 years ago
Gerber, Mike
e4e2777cb7
🐛 dinglehopper: Do try to get text when no TextEquivs exist
4 years ago
Gerber, Mike
f14ae46870
Merge branch 'feat/text-extraction-levels'
4 years ago
Gerber, Mike
1c88891a98
✔️ Add test data for LAREX's indexed TextEquivs (unused)
4 years ago
Gerber, Mike
19d15e3ecc
🐛 dinglehopper: Honor TextEquiv index (Closes GH-33)
4 years ago
Gerber, Mike
f626a2ebe6
🧹 dinglehopper: Remove warning when there is a non-TextRegion in the ReadingOrder
4 years ago
Gerber, Mike
0f3857d8d3
📝 Document OCR-D parameters and restructure README a bit
4 years ago
Gerber, Mike
8b4ee20a40
✨ Add a new CLI tool dinglehopper-extract to just give the extracted text
4 years ago
Gerber, Mike
b23b75b601
✨ dinglehopper: Give segment ids from the extracted textequiv_level
4 years ago
Gerber, Mike
b23e4ce30e
✨ dinglehopper: Add OCR-D parameter to choose TextEquiv level
4 years ago
Gerber, Mike
9744fa2567
✨ dinglehopper: Add CLI option to choose TextEquiv level
4 years ago
Gerber, Mike
75733039b8
🧹 dinglehopper: Do not hardcode joiner to \n
4 years ago
Gerber, Mike
3848412349
✨ dinglehopper: Implement the basic text extraction from PAGE TextLines
4 years ago
Gerber, Mike
f2367ac0c3
🐛 Fix OCR-D CLI for newest OCR-D
...
Now that find_files() is a generator, we can't use [0] to get the file.
4 years ago
Gerber, Mike
5ed184c8c4
✨ dinglehopper: Show a progressbar on --progress
4 years ago
Gerber, Mike
4951823a29
🧹 dinglehopper: Disable metrics in JSON report, too
4 years ago
Gerber, Mike
5303eea80c
📝 dinglehopper: Update README to use OCR-D's new and more readable -P option
4 years ago
Gerber, Mike
82217a25bb
🧹 dinglehopper: Move all normalization code to extracted_text.py
4 years ago
Gerber, Mike
009fa55c09
Merge branch 'master' of https://github.com/qurator-spk/dinglehopper
4 years ago