Gerber, Mike
691be243f6
continuous-integration/drone/push Build is failing
Details
We were using the file group name BEST for what Kitodo seems to call MAX by convention. So we use MAX now. Currently, we work under the assumption that, if MAX exists in the METS retrieved by OAI-PMH, it's not what we want and we replace it with our own IIIF URLS with full size. Fixes GH-43. |
4 years ago | |
---|---|---|
data@3fbdbcf368 | 4 years ago | |
wrapper | 4 years ago | |
.drone.star | 4 years ago | |
.gitignore | 4 years ago | |
.gitmodules | 5 years ago | |
Dockerfile-core | 4 years ago | |
Dockerfile-core-cuda10.0 | 4 years ago | |
Dockerfile-core-cuda10.1 | 4 years ago | |
Dockerfile-dinglehopper | 4 years ago | |
Dockerfile-ocrd_calamari | 4 years ago | |
Dockerfile-ocrd_calamari03 | 4 years ago | |
Dockerfile-ocrd_cis | 4 years ago | |
Dockerfile-ocrd_fileformat | 4 years ago | |
Dockerfile-ocrd_olena | 4 years ago | |
Dockerfile-ocrd_segment | 4 years ago | |
Dockerfile-ocrd_tesserocr | 4 years ago | |
Dockerfile-ocrd_wrap | 4 years ago | |
Dockerfile-sbb_binarization | 4 years ago | |
Dockerfile-sbb_textline_detector | 4 years ago | |
LICENSE | 4 years ago | |
README-DEV.md | 4 years ago | |
README.md | 4 years ago | |
build | 4 years ago | |
build-tmp-XXX | 4 years ago | |
my_ocrd_workflow | 4 years ago | |
my_ocrd_workflow-sbb | 4 years ago | |
ocrd-workspace-from-images | 4 years ago | |
ppn2ocr | 4 years ago | |
qurator_data_lib.sh | 4 years ago | |
requirements-ppn2ocr.txt | 5 years ago | |
zdb2ocr | 5 years ago |
README.md
ocrd-galley
A Dockerized test environment for OCR-D processors 🚢
WIP. Given a OCR-D workspace with document images in the OCR-D-IMG file group, the example workflow produces:
- Binarized images
- Line segmentation
- OCR text (using Calamari and Tesseract, both with GT4HistOCR models)
- (Given ground truth in OCR-D-GT-PAGE, also an OCR text evaluation report)
If you're interested in the exact processors, versions and parameters, please take a look at the script and possibly the individual Dockerfiles.
Goal
Provide a test environment to produce OCR output for historical prints, using OCR-D, especially ocrd_calamari and sbb_textline_detection, including all dependencies in Docker.
How to use
Currently, due to problems with the Travis CI, we do not provide pre-built containers anymore.*
To build the containers yourself using Docker:
cd ~/devel/ocrd-galley/
./build
You can then install the wrappers into a Python venv:
cd ~/devel/ocrd-galley/wrapper
pip install .
You may then use the script my_ocrd_workflow
to use your self-built
containers on an example workspace:
# Download an example workspace
cd /tmp
wget https://qurator-data.de/examples/actevedef_718448162.first-page.zip
unzip actevedef_718448162.first-page.zip
# Run the workflow on it
cd actevedef_718448162.first-page
~/devel/ocrd-galley/my_ocrd_workflow
Viewing results
You may then examine the results using PRImA's PAGE Viewer:
java -jar /path/to/JPageViewer.jar \
--resolve-dir . \
OCR-D-OCR-CALAMARI/OCR-D-OCR-CALAMARI_00000024.xml
The example workflow also produces OCR evaluation reports using dinglehopper, if ground truth was available:
firefox OCR-D-OCR-CALAMARI-EVAL/OCR-D-OCR-CALAMARI-EVAL_00000024.html
ppn2ocr
The ppn2ocr
script produces a workspace and METS file with the best images for
a given document in the Berlin State Library (SBB)'s digitized collection.
Install it with an up-to-date pip (otherwise this will fail due to a opencv-python-headless build failure):
pip install -r ~/devel/ocrd-galley/requirements-ppn2ocr.txt
The document must be specified by its PPN, for example:
~/devel/ocrd-galley/ppn2ocr PPN77164308X
cd PPN77164308X
~/devel/ocrd-galley/my_ocrd_workflow -I MAX --skip-validation
This produces a workspace directory PPN77164308X
with the OCR results in it;
the results are viewable as explained above.
ppn2ocr requires properly set up environment variables for the proxy
configuration. At SBB, please read howto/docker-proxy.md
and
howto/proxy-settings-for-shell+python.md
(in qurator's mono-repo).
ocrd-workspace-from-images
The ocrd-workspace-from-images
script produces a OCR-D workspace (incl. METS)
for the given images.
~/devel/ocrd-galley/ocrd-workspace-from-images 0005.png
cd workspace-xxxxx # output by the last command
~/devel/ocrd-galley/my_ocrd_workflow
This produces a workspace from the files and then runs the OCR workflow on it.