1
0
Fork 0
mirror of https://github.com/qurator-spk/dinglehopper.git synced 2025-06-09 03:40:12 +02:00
Commit graph

117 commits

Author SHA1 Message Date
Benjamin Rosemann
0dd5fc0ee5 Small corrections 2021-02-16 11:28:24 +01:00
Benjamin Rosemann
b24d8d5664 Performance increases
Temporarily switch to the c-implementation of python-levenshtein for
editops calculatation. Also added some variables, caching and type
changes for performance gains.
2021-02-16 11:28:24 +01:00
Benjamin Rosemann
0ef7810dd0 Reduce number of splits for short (one char) elements 2021-02-16 11:28:24 +01:00
Benjamin Rosemann
c9219cbacd Make sure that 0 cer and wer are reported 2021-02-16 11:28:23 +01:00
Benjamin Rosemann
fd6f57a263 Fix broken build on Python 3.5 2021-02-16 11:28:23 +01:00
Benjamin Rosemann
cac437afbf Evaluate some performance issues 2021-02-16 11:28:23 +01:00
Benjamin Rosemann
1bc7ef6c8b Correct report for fca
As the fca implementation already knows the editing operations for each
segment we use a different sequence alignment method.
2021-02-16 11:28:23 +01:00
Benjamin Rosemann
750ad00d1b Add tooltips to fca report 2021-02-16 11:28:23 +01:00
Benjamin Rosemann
53064bf833 Include fca as parameter and add some tests 2021-02-16 11:28:23 +01:00
Benjamin Rosemann
4a87adc2c7 Implement version specific data structures
As ocr-d continues the support for Python 3.5 until the end of this year
version specific data structures have been implemented.

When the support for Python 3.5 is dropped the extra file can easily be
removed.
2021-02-16 11:28:23 +01:00
Benjamin Rosemann
2a215a1062 Reformat using black 2021-02-16 11:28:23 +01:00
Benjamin Rosemann
5277593bdb Fix some special cases 2021-02-16 11:28:23 +01:00
Benjamin Rosemann
d7a74fa58b First draft of flexible character accuracy 2021-02-16 11:28:23 +01:00
Benjamin Rosemann
a68fc269d9 Fix the extraction of text from Page with TableRegion
Dinglehopper did not consider `OrderedGroupIndex` in the `ReadingOrder`
element when extracting text regions. As a consequence a `TableRegion`
was not considered for text extraction.
2020-11-27 11:18:11 +01:00
Konstantin Baierer
74e0ac18ed ocrd cli: use core-provided zip_input_files method 2020-11-19 16:00:28 +01:00
389e253c11 🐛 dinglehopper: Fix alto_extract_lines()'s type annotation 2020-11-12 19:32:38 +01:00
fe3923a8af 🐛 dinglehopper: Fix alto_extract()'s type annotation 2020-11-12 19:19:05 +01:00
132f91d500 ✔️ dinglehopper: Add missing integration test markers 2020-11-12 19:10:23 +01:00
Benjamin Rosemann
ce752e1912 Remove .idea folder and modify .gitignore
Sharing even parts of the .idea folder in worldwide setting is bound to
generate more problems than solutions. Therefore it should be removed
and consequently ignore in .gitignore.

Also adds some Python specific stuff to the .gitignore file.
2020-11-11 11:36:17 +01:00
Benjamin Rosemann
5270737c1f Skip test on windows because it is unix specific. 2020-11-11 11:36:17 +01:00
32a4b95a99 🐛 dinglehopper: Normalize in plain_extract() 2020-11-10 18:51:14 +01:00
14421c8e53 🎨 dinglehopper: Reformat using black 2020-11-10 12:29:55 +01:00
31c63f9e4c 🎨 dinglehopper: s/LOG/log 2020-11-09 16:55:43 +01:00
Robert Sachunsky
a60c14351e
1 more update for core's getLogger context 2020-11-03 17:46:59 +01:00
Benjamin Rosemann
c02569b41e Fix f-strings for Python 3.5 2020-10-29 12:33:54 +01:00
Benjamin Rosemann
7b27b2834e More complex sorting for text extraction
When extracting text from TextEquiv nodes we may encounter nodes without
index or nodes that should get sorted via the conf attribute.

Therefore we added a more complex algorithm to extract a TextEquiv and
inform the user via log messages if we encounter structures that we can
handle but may produce unexpected results.
2020-10-29 10:03:40 +01:00
Benjamin Rosemann
6ff831dfd2 Sort textlines with missing indices
Python's `sorted` method will fail with a TypeError when called with
`None` and Integers:

```python
>>> sorted([None, 1])
TypeError: '<' not supported between instances of 'int' and 'NoneType'
```
Therefore we are using `float('inf')` instead of `None` in case of
missing textline indices.
2020-10-29 10:03:40 +01:00
5cbe148741 🐛 dinglehopper: Skip pages if there is no GT nor OCR (Fixes GH-34) 2020-10-21 19:29:45 +02:00
e4e2777cb7 🐛 dinglehopper: Do try to get text when no TextEquivs exist 2020-10-21 17:59:44 +02:00
1c88891a98 ✔️ Add test data for LAREX's indexed TextEquivs (unused) 2020-10-21 17:51:15 +02:00
19d15e3ecc 🐛 dinglehopper: Honor TextEquiv index (Closes GH-33) 2020-10-21 17:50:21 +02:00
f626a2ebe6 🧹 dinglehopper: Remove warning when there is a non-TextRegion in the ReadingOrder 2020-10-21 17:03:55 +02:00
8b4ee20a40 Add a new CLI tool dinglehopper-extract to just give the extracted text 2020-10-21 16:30:48 +02:00
b23b75b601 dinglehopper: Give segment ids from the extracted textequiv_level 2020-10-21 16:04:33 +02:00
b23e4ce30e dinglehopper: Add OCR-D parameter to choose TextEquiv level 2020-10-21 14:38:19 +02:00
9744fa2567 dinglehopper: Add CLI option to choose TextEquiv level 2020-10-20 19:33:39 +02:00
75733039b8 🧹 dinglehopper: Do not hardcode joiner to \n 2020-10-20 18:43:56 +02:00
3848412349 dinglehopper: Implement the basic text extraction from PAGE TextLines 2020-10-20 18:40:21 +02:00
f2367ac0c3 🐛 Fix OCR-D CLI for newest OCR-D
Now that find_files() is a generator, we can't use [0] to get the file.
2020-10-16 14:58:27 +02:00
5ed184c8c4 dinglehopper: Show a progressbar on --progress 2020-10-15 16:09:54 +02:00
4951823a29 🧹 dinglehopper: Disable metrics in JSON report, too 2020-10-15 15:38:15 +02:00
82217a25bb 🧹 dinglehopper: Move all normalization code to extracted_text.py 2020-10-08 17:29:25 +02:00
c6c6b8efab 📝 dinglehopper: Add detail about the text extraction and ExtractedText 2020-10-08 17:05:36 +02:00
f50591abac Merge branch 'feat/display-segment-id' 2020-10-08 13:39:38 +02:00
c514abfb9f 🧹 dinglehopper: Sanitize imports 2020-10-08 13:33:19 +02:00
1077dc64ce ➡️ dinglehopper: Move ExtractedText to its own file 2020-10-08 13:25:20 +02:00
9dd4ff0aae dinglehopper: Extract line IDs for ALTO 2020-10-08 12:54:28 +02:00
f3aafb6fdf dinglehopper: Validate ExtractedText.{segments,_text} in both directions 2020-10-08 12:20:27 +02:00
b14c35e147 🎨 dinglehopper: Use multimethod to handle str vs ExtractedText 2020-10-08 12:15:58 +02:00
a17ee2afec 🚧 dinglehopper: Guarantee NFC + rename from_text → from_str 2020-10-08 11:25:01 +02:00