Commit Graph

31 Commits (fe1a713d559b11461f0101f729ce87a45e4b51d9)

Author SHA1 Message Date
Mike Gerber 4d4ead4cc8 🐛 Fix word segmentation with uniseg 0.8.0
Mike Gerber 38fcbc8e1c Merge branch 'master' into performance
Mike Gerber 68a12f8f7f ⬆ Update uniseg dependency
@maxbachmann also improved the performance of uniseg, and it is in 0.7.2 - update our
dependency.
Mike Gerber 7ed076d3c1 ⬆ Update multimethod dependency
We had some issues while reviewing/rebasing . We don't support Python 3.5 anymore,
so lifting the hard pin on multimethod 1.3.
Mike Gerber d8f84ec9ac 🧹 Remove old six dependency (workaround for )
Mike Gerber 1c3b28d873 ⬆ Update multimethod dependency
We had some issues while reviewing/rebasing . We don't support Python 3.5 anymore,
so lifting the hard pin on multimethod 1.3.
Mike Gerber 69325facf2 🐛 Detect encoding (incl BOM) when reading files
As @imlabormitlea-code reported in gh-79, dinglehopper did not handle text files with
BOM well. Fix this by using chardet to detect an encoding, which also detects the BOM
and use the proper encoding to read the files, not including the BOM in the resulting
extracted text.

Fixes gh-80.
Max Bachmann f48e305347
use uniseg again
Max Bachmann d2bbc8a6c7 update rapidfuzz version
Max Bachmann a1f0a5e2d3 replace uniseg with uniseg2
Gerber, Mike 15dfbac3a7 Revert "Revert "Merge pull request from maxbachmann/rapidfuzz""
This reverts commit 76bd50f1db.
Gerber, Mike ede9402a6c Revert "💩 Stick with rapidfuzz < 2.1.0 for now"
This reverts commit 0e153db9ca.
Gerber, Mike 0e153db9ca 💩 Stick with rapidfuzz < 2.1.0 for now
Gerber, Mike 76bd50f1db Revert "Merge pull request from maxbachmann/rapidfuzz"
This reverts commit 85f751aacc, reversing
changes made to 1febea8c92.
Max Bachmann e543438496 replace usage of deprecated rapidfuzz APIs
Gerber, Mike 76bacc0f15 🐛 Bump rapidfuzz dep to >= 2.0.5 (Fixes gh-65)
Gerber, Mike f0f3cd2d96 ⬆️ dinglehopper: Require rapidfuzz >= 1.9.1
See https://github.com/qurator-spk/dinglehopper/issues/64.
Gerber, Mike a5c9c7438f 💩 ocrd-galley: Work around
OCR-D/core currently needs six until the next relaase. Fix the build by
requiring it here.
Gerber, Mike af8da1d716 dinglehopper: Use rapidfuzz for editops
Gerber, Mike 8cd8314c8a 🐛 dinglehopper: Bump up ocrd req for zip_input_files
See also GH-49.
Gerber, Mike f2367ac0c3 🐛 Fix OCR-D CLI for newest OCR-D
Now that find_files() is a generator, we can't use [0] to get the file.
Gerber, Mike 5ed184c8c4 dinglehopper: Show a progressbar on --progress
Gerber, Mike f50591abac Merge branch 'feat/display-segment-id'
Gerber, Mike b14c35e147 🎨 dinglehopper: Use multimethod to handle str vs ExtractedText
Konstantin Baierer 004ae298ca ocrd cli: use make_file_id and assert_file_grp_cardinality
Gerber, Mike 2c69e077fe 🚧 dinglehopper: WIP data structure for extracted text
Gerber, Mike cdfd4d321d 🐛 dinglehopper: Add missing requirement MarkupSafe
Gerber, Mike 48a31ce672 Revert "Merge branch 'master' of https://github.com/qurator-spk/sbb_textline_detector"
This reverts commit 2c89bf3b35ee290d7b830ef270df3a96aa48245e, reversing
changes made to 9f7e413148ca5dbac9b555d7b0d0a5fa3a0f5340.
b-vr103 1303a7d92f Merge branch 'master' of https://github.com/qurator-spk/sbb_textline_detector
Gerber, Mike 02a0e093bf dinglehopper: Add OCR-D interface
Gerber, Mike 89048bf55d ➡ Move dinglehopper into its own directory