Mike Gerber
b336f98271
🐛 Fix reading plain text files
...
As reported by @tallemeersch in gh-107, newlines were not removed for plain text files.
Fix this by stripping the lines as suggested.
Fixes gh-107.
1 week ago
Mike Gerber
41a0fad352
📦 v0.9.6
1 week ago
Stefan Weil
79701e410d
Fix some typos (found by `codespell` and `typos`)
...
Signed-off-by: Stefan Weil <sw@weilnetz.de>
2 weeks ago
Mike Gerber
2383730a55
✔ Test using empty files
...
Test edge cases + empty files, e.g. empty text content and a Unicode BOM character.
See also gh-79.
1 month ago
Mike Gerber
edabffec7e
🧹 tests: Move comment out of the code (bad style + weird formatting)
1 month ago
Mike Gerber
32d4037533
⚙ cli: Annotate types in process_dir()
1 month ago
Mike Gerber
be7c1dd25d
🧹 Make from_text_segment()'s textequiv_level keyword-only
2 months ago
Mike Gerber
932bfafc7d
🧹 Make process_dir() keyword arguments keyword-only
2 months ago
Mike Gerber
c29a80bc81
📦 v0.9.5
2 months ago
Mike Gerber
5d9f0c482f
🐛 Check that we always get a valid ALTO namespace (satifies mypy)
2 months ago
Mike Gerber
19d1a00817
🎨 Reformat (Black)
2 months ago
Mike Gerber
4d4ead4cc8
🐛 Fix word segmentation with uniseg 0.8.0
2 months ago
Mike Gerber
483e809691
🔍 mypy: Use an almost strict mypy configuration, and fix any issues
4 months ago
Mike Gerber
ad316aeabc
🔍 mypy: Use a compatible syntax for multimethod
4 months ago
Mike Gerber
8166435958
🔍 mypy: Remove ExtractedText.segments converter
4 months ago
Mike Gerber
24c25b6fcd
🔍 mypy: Avoid using check() for all attr validators
4 months ago
Mike Gerber
ac9d360dcd
🔍 mypy: Make cli.process() typed so mypy checks it (and issues no warning)
4 months ago
Sadra Barikbin
4466422cda
Fix a typo
4 months ago
Sadra Barikbin
c90a61c12c
Fix a few typos
4 months ago
Mike Gerber
c752793be6
🐛 Use typing.List instead of list, for Python <3.9
4 months ago
Mike Gerber
071766efc2
🐛 Use Optional instead of | none, for Python <3.10
4 months ago
Mike Gerber
c1681551af
🐛 Fix generating word differences
4 months ago
Mike Gerber
296a820990
Merge branch 'master' of https://github.com/qurator-spk/dinglehopper
4 months ago
Mike Gerber
38fcbc8e1c
Merge branch 'master' into performance
4 months ago
Sadra Barikbin
b0e906ad00
Update Levenshtein.ipynb
...
Fix a tiny typo in Levenshtein notebook.
5 months ago
Mike Gerber
f077ce2e1b
🐛 dinglehopper-summarize: Handle reports without difference stats
7 months ago
Mike Gerber
8a1ea4ec93
🎨 Add newlines at end of files (ruff)
7 months ago
Mike Gerber
9d862e418b
✔ Add mets:FLocat's @LOCTYPE/OTHERLOCTYPE to test data
...
Newest OCR-D wasn't happy with the test data anymore (see gh-89). I'm not sure if the
test data was invalid the way it was, but having a LOCTYPE certainly is "prettier" so
adding it. This fixes the test again.
7 months ago
Mike Gerber
a1a7f95ac6
📦 v0.9.4
9 months ago
Mike Gerber
6c70afbbc5
📦 v0.9.3
9 months ago
Mike Gerber
98a67c7b3b
📦 v0.9.2
9 months ago
Mike Gerber
1c95a82941
📦 v0.9.1
9 months ago
Mike Gerber
1dad18909c
🧹 Make dinglehopper.* exports explicit
10 months ago
Mike Gerber
e4431797e6
🎨 Reformat comments + strings manually (not auto-fixed by Black)
10 months ago
Mike Gerber
704e7cca1c
⬆ Use f-strings
10 months ago
Mike Gerber
bea56117ae
🎨 Reformat using Black
10 months ago
Mike Gerber
d50d624554
🎨 Sort imports (auto-fixed by ruff)
10 months ago
Mike Gerber
69325facf2
🐛 Detect encoding (incl BOM) when reading files
...
As @imlabormitlea-code reported in gh-79, dinglehopper did not handle text files with
BOM well. Fix this by using chardet to detect an encoding, which also detects the BOM
and use the proper encoding to read the files, not including the BOM in the resulting
extracted text.
Fixes gh-80.
10 months ago
Mike Gerber
325e5af5f5
🐛 Move source into src/ to fix install
...
Installing was broken since moving to pyproject.toml, which we didn't notice because of
leftover files in build/. Fix this by using the convention of having the source files
in src/ and adjusting pyproject.toml accordingly.
Fixes gh-86. 🤞
10 months ago