31c63f9e4c
🎨 dinglehopper: s/LOG/log
2020-11-09 16:55:43 +01:00
0804b029c4
Merge pull request #43 from bertsky/patch-1
...
1 more update for core's getLogger context
2020-11-09 16:51:00 +01:00
Robert Sachunsky
a60c14351e
1 more update for core's getLogger context
2020-11-03 17:46:59 +01:00
a51f0b3dcd
Merge pull request #42 from b2m/test-python-cache-for-travis
...
Add travis pip caching
2020-10-30 12:35:20 +01:00
Benjamin Rosemann
b10af9f138
Test travis pip caching
2020-10-29 16:41:19 +01:00
089f6d299e
Merge pull request #37 from b2m/fix-sort-with-none
...
Sort textlines with missing indices
2020-10-29 15:05:46 +01:00
5138a1de21
Merge pull request #39 from b2m/test-python-3.9
...
Add Python 3.9 to .travis.yml
2020-10-29 13:42:24 +01:00
Benjamin Rosemann
c02569b41e
Fix f-strings for Python 3.5
2020-10-29 12:33:54 +01:00
Benjamin Rosemann
7b27b2834e
More complex sorting for text extraction
...
When extracting text from TextEquiv nodes we may encounter nodes without
index or nodes that should get sorted via the conf attribute.
Therefore we added a more complex algorithm to extract a TextEquiv and
inform the user via log messages if we encounter structures that we can
handle but may produce unexpected results.
2020-10-29 10:03:40 +01:00
Benjamin Rosemann
6ff831dfd2
Sort textlines with missing indices
...
Python's `sorted` method will fail with a TypeError when called with
`None` and Integers:
```python
>>> sorted([None, 1])
TypeError: '<' not supported between instances of 'int' and 'NoneType'
```
Therefore we are using `float('inf')` instead of `None` in case of
missing textline indices.
2020-10-29 10:03:40 +01:00
Benjamin Rosemann
e77f19fefc
Add Python 3.9 to .travis.yml
2020-10-29 10:02:51 +01:00
082fc9e09a
Merge pull request #38 from b2m/add-editorconfig
...
Add .editorconfig
2020-10-28 15:16:04 +01:00
Benjamin Rosemann
20661487d6
Add .editorconfig
...
Add a proposal for a .editorconfig file (see https://editorconfig.org/ ).
This is natively supported by a lot of editors, others are supported via
plugins.
This will close #19 .
2020-10-28 11:31:18 +01:00
6e47acda1c
📝 dinglehopper: Move screenshot higher
2020-10-21 19:31:53 +02:00
5cbe148741
🐛 dinglehopper: Skip pages if there is no GT nor OCR (Fixes GH-34)
2020-10-21 19:29:45 +02:00
e4e2777cb7
🐛 dinglehopper: Do try to get text when no TextEquivs exist
2020-10-21 17:59:44 +02:00
f14ae46870
Merge branch 'feat/text-extraction-levels'
2020-10-21 17:51:44 +02:00
1c88891a98
✔️ Add test data for LAREX's indexed TextEquivs (unused)
2020-10-21 17:51:15 +02:00
19d15e3ecc
🐛 dinglehopper: Honor TextEquiv index (Closes GH-33)
2020-10-21 17:50:21 +02:00
f626a2ebe6
🧹 dinglehopper: Remove warning when there is a non-TextRegion in the ReadingOrder
2020-10-21 17:03:55 +02:00
0f3857d8d3
📝 Document OCR-D parameters and restructure README a bit
2020-10-21 16:54:23 +02:00
8b4ee20a40
✨ Add a new CLI tool dinglehopper-extract to just give the extracted text
2020-10-21 16:30:48 +02:00
b23b75b601
✨ dinglehopper: Give segment ids from the extracted textequiv_level
2020-10-21 16:04:33 +02:00
b23e4ce30e
✨ dinglehopper: Add OCR-D parameter to choose TextEquiv level
2020-10-21 14:38:19 +02:00
9744fa2567
✨ dinglehopper: Add CLI option to choose TextEquiv level
2020-10-20 19:33:39 +02:00
75733039b8
🧹 dinglehopper: Do not hardcode joiner to \n
2020-10-20 18:43:56 +02:00
3848412349
✨ dinglehopper: Implement the basic text extraction from PAGE TextLines
2020-10-20 18:40:21 +02:00
f2367ac0c3
🐛 Fix OCR-D CLI for newest OCR-D
...
Now that find_files() is a generator, we can't use [0] to get the file.
2020-10-16 14:58:27 +02:00
5ed184c8c4
✨ dinglehopper: Show a progressbar on --progress
2020-10-15 16:09:54 +02:00
4951823a29
🧹 dinglehopper: Disable metrics in JSON report, too
2020-10-15 15:38:15 +02:00
5303eea80c
📝 dinglehopper: Update README to use OCR-D's new and more readable -P option
2020-10-15 15:37:51 +02:00
82217a25bb
🧹 dinglehopper: Move all normalization code to extracted_text.py
2020-10-08 17:29:25 +02:00
009fa55c09
Merge branch 'master' of https://github.com/qurator-spk/dinglehopper
2020-10-08 17:17:40 +02:00
c20bbbfa25
📝 dinglehopper: Update screenshot to include a region id tooltip
2020-10-08 17:17:34 +02:00
252bf9b3e7
📝 dinglehopper: Fix markdown in README.md
2020-10-08 17:14:29 +02:00
c6c6b8efab
📝 dinglehopper: Add detail about the text extraction and ExtractedText
2020-10-08 17:05:36 +02:00
7025ea54a8
📝 dinglehopper: Move developer info to README-DEV.md
2020-10-08 16:59:50 +02:00
f50591abac
Merge branch 'feat/display-segment-id'
2020-10-08 13:39:38 +02:00
c514abfb9f
🧹 dinglehopper: Sanitize imports
2020-10-08 13:33:19 +02:00
1077dc64ce
➡️ dinglehopper: Move ExtractedText to its own file
2020-10-08 13:25:20 +02:00
9dd4ff0aae
✨ dinglehopper: Extract line IDs for ALTO
2020-10-08 12:54:28 +02:00
f3aafb6fdf
✨ dinglehopper: Validate ExtractedText.{segments,_text} in both directions
2020-10-08 12:20:27 +02:00
1f9a680fe7
⚙️ dinglehopper: PyCharm should use dinglehopper-github virtualenv
2020-10-08 12:16:42 +02:00
b14c35e147
🎨 dinglehopper: Use multimethod to handle str vs ExtractedText
2020-10-08 12:15:58 +02:00
a17ee2afec
🚧 dinglehopper: Guarantee NFC + rename from_text → from_str
2020-10-08 11:25:01 +02:00
7843824eaf
🚧 dinglehopper: Support str & ExtractedText in CER and distance functions
2020-10-08 10:47:20 +02:00
5bee55c896
💩 dinglehopper: Fix OCR-D CLI test by working around ocrd_cli_wrap_processor() check for arguments
2020-10-07 18:40:06 +02:00
96b55f1806
🚧 dinglehopper: Hierarchical text representation
2020-10-07 18:31:52 +02:00
db6292611f
🧹 dinglehopper: Remove merged text extraction test code
2020-10-07 16:07:27 +02:00
d706ef4621
📝 Document CER/WER and the format detection (Fixes GH-26)
2020-09-30 17:58:05 +02:00