1
0
Fork 0
mirror of https://github.com/qurator-spk/dinglehopper.git synced 2025-07-06 17:09:59 +02:00
Commit graph

77 commits

Author SHA1 Message Date
c3aa48ec3b Merge branch 'master' of https://github.com/qurator-spk/dinglehopper 2025-04-24 17:16:06 +02:00
628594ef98 📦 v0.11.0 2025-04-24 17:14:44 +02:00
5639f3db7f ✔ Add a tests that checks if plain text files with BOM are read correctly 2025-04-24 16:44:29 +02:00
14a4bc56d8 🐛 Add --plain-encoding option to dinglehopper-extract 2025-04-22 18:24:35 +02:00
a70260c10e 🐛 Use warning() to fix DeprecationWarning 2025-04-22 13:57:19 +02:00
224aa02163 🚧 Fix help text 2025-04-22 13:57:19 +02:00
9db5b4caf5 🚧 Add OCR-D parameter for plain text encoding 2025-04-22 13:57:19 +02:00
5578ce83a3 🚧 Add option for text encoding to line dir cli 2025-04-22 13:57:19 +02:00
cf59b951a3 🚧 Add option for text encoding to line dir cli 2025-04-22 13:57:19 +02:00
480b3cf864 ✔ Test that CLI produces a complete HTML report 2025-04-22 13:57:19 +02:00
f1a586cff1 ✔ Test line dirs CLI 2025-04-22 13:57:18 +02:00
3b16c14c16 ✔ Properly test line dir finding 2025-04-22 13:57:18 +02:00
322faeb26c 🎨 Sort imports 2025-04-22 13:57:18 +02:00
c37316da09 🐛 cli_line_dirs: Fix word differences section
At the time of generation of the section, the {gt,ocr}_words generators
were drained. Fix by using a list.

Fixes gh-124.
2025-04-22 13:57:18 +02:00
9414a92f9f 🐛 cli_line_dirs: Type-annotate functions 2025-04-22 13:57:18 +02:00
68344e48f8 🎨 Reformat cli_line_dirs 2025-04-22 13:57:18 +02:00
73ee16fe51 🚧 Support 'merged' GT+OCR line directories 2025-04-22 13:57:18 +02:00
6980d7a252 🚧 Use our own removesuffix() as we still support Python 3.8 2025-04-22 13:57:18 +02:00
2bf2529c38 🚧 Port new line dir functions 2025-04-22 13:57:17 +02:00
ad8e6de36b 🐛 cli_line_dirs: Fix character diff reports 2025-04-22 13:57:17 +02:00
4024e350f7 🚧 Test new flexible line dirs functions 2025-04-22 13:57:17 +02:00
817e0c95f7 📦 v0.10.1 2025-04-22 10:32:29 +02:00
Robert Sachunsky
64444dd419 opt out of 7f8a8dd5 (uniseg update that requires py39) 2025-04-17 16:12:37 +02:00
ef817cb343 📦 v0.10.0 2025-04-17 08:37:37 +02:00
kba
831a24fc4c typo: report_prefix -> file_id 2025-04-17 08:04:52 +02:00
Konstantin Baierer
f6a2c94520 ocrd_cli: but do check for existing output files
Co-authored-by: Robert Sachunsky <38561704+bertsky@users.noreply.github.com>
2025-04-17 08:04:52 +02:00
Konstantin Baierer
4162836612 ocrd_cli: no need to check fileGrp dir exists
Co-authored-by: Robert Sachunsky <38561704+bertsky@users.noreply.github.com>
2025-04-17 08:04:52 +02:00
Konstantin Baierer
c0aa82d188 OCR-D processor: properly handle missing or non-downloaded GT/OCR file
Co-authored-by: Robert Sachunsky <38561704+bertsky@users.noreply.github.com>
2025-04-17 08:04:51 +02:00
kba
63031b30bf Port to OCR-D/core API v3 2025-04-16 14:45:16 +02:00
7f8a8dd564 🐛 Fix for changed API of uniseg's word_break 2025-04-16 09:10:43 +02:00
f2e290dffe 🐛 Fix --version option in OCR-D CLI 2024-07-19 14:54:46 +02:00
6d1daf1dfe Support --version option in CLI 2024-07-19 14:41:54 +02:00
129e6eb427 📦 v0.9.7 2024-07-11 17:25:38 +02:00
6048107889 Merge branch 'master' of https://github.com/qurator-spk/dinglehopper 2024-07-11 16:26:29 +02:00
2ee37ed4e3 🎨 Sort imports 2024-07-11 16:25:38 +02:00
521f034fba
Merge pull request #116 from stweil/master
Fix typo
2024-07-10 01:13:24 +02:00
4047f8b6e5 🐛 Fix loading ocrd-tool.json for Python 3.12 2024-07-09 21:01:31 +02:00
Stefan Weil
cd68a973cb Fix typo
Signed-off-by: Stefan Weil <sw@weilnetz.de>
2024-05-26 09:18:00 +02:00
b336f98271 🐛 Fix reading plain text files
As reported by @tallemeersch in gh-107, newlines were not removed for plain text files.
Fix this by stripping the lines as suggested.

Fixes gh-107.
2024-05-06 18:14:16 +02:00
41a0fad352 📦 v0.9.6 2024-05-06 17:48:48 +02:00
Stefan Weil
79701e410d Fix some typos (found by codespell and typos)
Signed-off-by: Stefan Weil <sw@weilnetz.de>
2024-04-29 08:42:17 +02:00
2383730a55 ✔ Test using empty files
Test edge cases + empty files, e.g. empty text content and a Unicode BOM character.

See also gh-79.
2024-04-08 20:33:03 +02:00
edabffec7e 🧹 tests: Move comment out of the code (bad style + weird formatting) 2024-04-04 19:46:08 +02:00
32d4037533 ⚙ cli: Annotate types in process_dir() 2024-04-04 19:38:27 +02:00
be7c1dd25d 🧹 Make from_text_segment()'s textequiv_level keyword-only 2024-03-27 21:09:34 +01:00
932bfafc7d 🧹 Make process_dir() keyword arguments keyword-only 2024-03-27 19:44:09 +01:00
c29a80bc81 📦 v0.9.5 2024-03-27 18:49:13 +01:00
5d9f0c482f 🐛 Check that we always get a valid ALTO namespace (satifies mypy) 2024-03-27 17:57:53 +01:00
19d1a00817 🎨 Reformat (Black) 2024-03-27 17:36:05 +01:00
4d4ead4cc8 🐛 Fix word segmentation with uniseg 0.8.0 2024-03-26 19:34:22 +01:00