Benjamin Rosemann
40f23b8482
Added comments
2021-06-14 12:29:34 +02:00
Benjamin Rosemann
cee7b6891b
Fix CI Build
2021-06-12 10:40:23 +02:00
Benjamin Rosemann
714b569195
Fixed some flake8 and mypy issues.
2021-06-11 16:09:19 +02:00
Benjamin Rosemann
a44a3d4bf2
Error handling
2021-06-11 15:33:13 +02:00
Benjamin Rosemann
06468a436e
Implemented new metrics behaviour
2021-06-11 15:08:45 +02:00
Benjamin Rosemann
9f5112f8f6
Remove support for ExtractedText for bag metrics.
2021-06-11 10:23:26 +02:00
Benjamin Rosemann
381fe7cb6b
Switch to result tuple instead of multiple return parameters
2021-06-11 10:21:23 +02:00
Benjamin Rosemann
974ca3e5c0
Split html and json report generation
2021-06-11 09:35:26 +02:00
Benjamin Rosemann
8cd624f795
Add BoC and BoW metric
...
Also some refactoring for helper methods on normalization and word
splitting.
2021-06-08 17:41:44 +02:00
Benjamin Rosemann
4ccae9432d
Move metrics into separate package
2021-05-27 16:37:34 +02:00
249787686f
Merge branch 'master' of github.com:qurator-spk/dinglehopper
continuous-integration/drone/push Build is failing
2021-05-20 09:42:15 +02:00
2a6cc5823e
🐛 dinglehopper: Call initLogging before logging
...
When using ocrd_utils' getLogger(), we need to call initLogging() before doing any
logging.
Fixes #55 .
2021-05-20 09:39:09 +02:00
Konstantin Baierer
7fde00d911
ReadingOrder may also contain UnorderedGroupIndexed
2021-05-18 17:34:08 +02:00
1778b36a9a
🚧 dinglehopper: Read PAGE UnorderedGroup in XML order
2021-04-15 21:09:45 +02:00
Benjamin Rosemann
a68fc269d9
Fix the extraction of text from Page with TableRegion
...
Dinglehopper did not consider `OrderedGroupIndex` in the `ReadingOrder`
element when extracting text regions. As a consequence a `TableRegion`
was not considered for text extraction.
2020-11-27 11:18:11 +01:00
Konstantin Baierer
74e0ac18ed
ocrd cli: use core-provided zip_input_files method
2020-11-19 16:00:28 +01:00
389e253c11
🐛 dinglehopper: Fix alto_extract_lines()'s type annotation
2020-11-12 19:32:38 +01:00
fe3923a8af
🐛 dinglehopper: Fix alto_extract()'s type annotation
2020-11-12 19:19:05 +01:00
132f91d500
✔️ dinglehopper: Add missing integration test markers
2020-11-12 19:10:23 +01:00
Benjamin Rosemann
ce752e1912
Remove .idea folder and modify .gitignore
...
Sharing even parts of the .idea folder in worldwide setting is bound to
generate more problems than solutions. Therefore it should be removed
and consequently ignore in .gitignore.
Also adds some Python specific stuff to the .gitignore file.
2020-11-11 11:36:17 +01:00
Benjamin Rosemann
5270737c1f
Skip test on windows because it is unix specific.
2020-11-11 11:36:17 +01:00
32a4b95a99
🐛 dinglehopper: Normalize in plain_extract()
2020-11-10 18:51:14 +01:00
14421c8e53
🎨 dinglehopper: Reformat using black
2020-11-10 12:29:55 +01:00
31c63f9e4c
🎨 dinglehopper: s/LOG/log
2020-11-09 16:55:43 +01:00
Robert Sachunsky
a60c14351e
1 more update for core's getLogger context
2020-11-03 17:46:59 +01:00
Benjamin Rosemann
c02569b41e
Fix f-strings for Python 3.5
2020-10-29 12:33:54 +01:00
Benjamin Rosemann
7b27b2834e
More complex sorting for text extraction
...
When extracting text from TextEquiv nodes we may encounter nodes without
index or nodes that should get sorted via the conf attribute.
Therefore we added a more complex algorithm to extract a TextEquiv and
inform the user via log messages if we encounter structures that we can
handle but may produce unexpected results.
2020-10-29 10:03:40 +01:00
Benjamin Rosemann
6ff831dfd2
Sort textlines with missing indices
...
Python's `sorted` method will fail with a TypeError when called with
`None` and Integers:
```python
>>> sorted([None, 1])
TypeError: '<' not supported between instances of 'int' and 'NoneType'
```
Therefore we are using `float('inf')` instead of `None` in case of
missing textline indices.
2020-10-29 10:03:40 +01:00
5cbe148741
🐛 dinglehopper: Skip pages if there is no GT nor OCR (Fixes GH-34)
2020-10-21 19:29:45 +02:00
e4e2777cb7
🐛 dinglehopper: Do try to get text when no TextEquivs exist
2020-10-21 17:59:44 +02:00
1c88891a98
✔️ Add test data for LAREX's indexed TextEquivs (unused)
2020-10-21 17:51:15 +02:00
19d15e3ecc
🐛 dinglehopper: Honor TextEquiv index (Closes GH-33)
2020-10-21 17:50:21 +02:00
f626a2ebe6
🧹 dinglehopper: Remove warning when there is a non-TextRegion in the ReadingOrder
2020-10-21 17:03:55 +02:00
8b4ee20a40
✨ Add a new CLI tool dinglehopper-extract to just give the extracted text
2020-10-21 16:30:48 +02:00
b23b75b601
✨ dinglehopper: Give segment ids from the extracted textequiv_level
2020-10-21 16:04:33 +02:00
b23e4ce30e
✨ dinglehopper: Add OCR-D parameter to choose TextEquiv level
2020-10-21 14:38:19 +02:00
9744fa2567
✨ dinglehopper: Add CLI option to choose TextEquiv level
2020-10-20 19:33:39 +02:00
75733039b8
🧹 dinglehopper: Do not hardcode joiner to \n
2020-10-20 18:43:56 +02:00
3848412349
✨ dinglehopper: Implement the basic text extraction from PAGE TextLines
2020-10-20 18:40:21 +02:00
f2367ac0c3
🐛 Fix OCR-D CLI for newest OCR-D
...
Now that find_files() is a generator, we can't use [0] to get the file.
2020-10-16 14:58:27 +02:00
5ed184c8c4
✨ dinglehopper: Show a progressbar on --progress
2020-10-15 16:09:54 +02:00
4951823a29
🧹 dinglehopper: Disable metrics in JSON report, too
2020-10-15 15:38:15 +02:00
82217a25bb
🧹 dinglehopper: Move all normalization code to extracted_text.py
2020-10-08 17:29:25 +02:00
c6c6b8efab
📝 dinglehopper: Add detail about the text extraction and ExtractedText
2020-10-08 17:05:36 +02:00
f50591abac
Merge branch 'feat/display-segment-id'
2020-10-08 13:39:38 +02:00
c514abfb9f
🧹 dinglehopper: Sanitize imports
2020-10-08 13:33:19 +02:00
1077dc64ce
➡️ dinglehopper: Move ExtractedText to its own file
2020-10-08 13:25:20 +02:00
9dd4ff0aae
✨ dinglehopper: Extract line IDs for ALTO
2020-10-08 12:54:28 +02:00
f3aafb6fdf
✨ dinglehopper: Validate ExtractedText.{segments,_text} in both directions
2020-10-08 12:20:27 +02:00
b14c35e147
🎨 dinglehopper: Use multimethod to handle str vs ExtractedText
2020-10-08 12:15:58 +02:00