Commit Graph

39 Commits (master)

Author SHA1 Message Date
Mike Gerber b336f98271 🐛 Fix reading plain text files
As reported by @tallemeersch in gh-107, newlines were not removed for plain text files.
Fix this by stripping the lines as suggested.

Fixes gh-107.
1 week ago
Mike Gerber 41a0fad352 📦 v0.9.6 1 week ago
Stefan Weil 79701e410d Fix some typos (found by `codespell` and `typos`)
Signed-off-by: Stefan Weil <sw@weilnetz.de>
2 weeks ago
Mike Gerber 2383730a55 ✔ Test using empty files
Test edge cases + empty files, e.g. empty text content and a Unicode BOM character.

See also gh-79.
1 month ago
Mike Gerber edabffec7e 🧹 tests: Move comment out of the code (bad style + weird formatting) 1 month ago
Mike Gerber 32d4037533 ⚙ cli: Annotate types in process_dir() 1 month ago
Mike Gerber be7c1dd25d 🧹 Make from_text_segment()'s textequiv_level keyword-only 2 months ago
Mike Gerber 932bfafc7d 🧹 Make process_dir() keyword arguments keyword-only 2 months ago
Mike Gerber c29a80bc81 📦 v0.9.5 2 months ago
Mike Gerber 5d9f0c482f 🐛 Check that we always get a valid ALTO namespace (satifies mypy) 2 months ago
Mike Gerber 19d1a00817 🎨 Reformat (Black) 2 months ago
Mike Gerber 4d4ead4cc8 🐛 Fix word segmentation with uniseg 0.8.0 2 months ago
Mike Gerber 483e809691 🔍 mypy: Use an almost strict mypy configuration, and fix any issues 4 months ago
Mike Gerber ad316aeabc 🔍 mypy: Use a compatible syntax for multimethod 4 months ago
Mike Gerber 8166435958 🔍 mypy: Remove ExtractedText.segments converter 4 months ago
Mike Gerber 24c25b6fcd 🔍 mypy: Avoid using check() for all attr validators 4 months ago
Mike Gerber ac9d360dcd 🔍 mypy: Make cli.process() typed so mypy checks it (and issues no warning) 4 months ago
Sadra Barikbin 4466422cda Fix a typo 4 months ago
Sadra Barikbin c90a61c12c Fix a few typos 4 months ago
Mike Gerber c752793be6 🐛 Use typing.List instead of list, for Python <3.9 4 months ago
Mike Gerber 071766efc2 🐛 Use Optional instead of | none, for Python <3.10 4 months ago
Mike Gerber c1681551af 🐛 Fix generating word differences 4 months ago
Mike Gerber 296a820990 Merge branch 'master' of https://github.com/qurator-spk/dinglehopper 4 months ago
Mike Gerber 38fcbc8e1c Merge branch 'master' into performance 4 months ago
Sadra Barikbin b0e906ad00
Update Levenshtein.ipynb
Fix a tiny typo in Levenshtein notebook.
5 months ago
Mike Gerber f077ce2e1b 🐛 dinglehopper-summarize: Handle reports without difference stats 7 months ago
Mike Gerber 8a1ea4ec93 🎨 Add newlines at end of files (ruff) 7 months ago
Mike Gerber 9d862e418b ✔ Add mets:FLocat's @LOCTYPE/OTHERLOCTYPE to test data
Newest OCR-D wasn't happy with the test data anymore (see gh-89). I'm not sure if the
test data was invalid the way it was, but having a LOCTYPE certainly is "prettier" so
adding it. This fixes the test again.
7 months ago
Mike Gerber a1a7f95ac6 📦 v0.9.4 9 months ago
Mike Gerber 6c70afbbc5 📦 v0.9.3 9 months ago
Mike Gerber 98a67c7b3b 📦 v0.9.2 9 months ago
Mike Gerber 1c95a82941 📦 v0.9.1 9 months ago
Mike Gerber 1dad18909c 🧹 Make dinglehopper.* exports explicit 10 months ago
Mike Gerber e4431797e6 🎨 Reformat comments + strings manually (not auto-fixed by Black) 10 months ago
Mike Gerber 704e7cca1c ⬆ Use f-strings 10 months ago
Mike Gerber bea56117ae 🎨 Reformat using Black 10 months ago
Mike Gerber d50d624554 🎨 Sort imports (auto-fixed by ruff) 10 months ago
Mike Gerber 69325facf2 🐛 Detect encoding (incl BOM) when reading files
As @imlabormitlea-code reported in gh-79, dinglehopper did not handle text files with
BOM well. Fix this by using chardet to detect an encoding, which also detects the BOM
and use the proper encoding to read the files, not including the BOM in the resulting
extracted text.

Fixes gh-80.
10 months ago
Mike Gerber 325e5af5f5 🐛 Move source into src/ to fix install
Installing was broken since moving to pyproject.toml, which we didn't notice because of
leftover files in build/. Fix this by using the convention of having the source files
in src/ and adjusting pyproject.toml accordingly.

Fixes gh-86. 🤞
10 months ago