mirror of https://github.com/qurator-spk/ocrd-galley.git synced 2025-12-23 20:14:12 +01:00

No description

Find a file

Mike Gerber db63b46e65 📝 ppn2ocr: Mention upgrading pip Installing the ppn2ocr requirements fails when pip is not updated (see issue below), so mention the issue in the README. Quoting https://github.com/skvark/opencv-python#frequently-asked-questions: Q: Pip install fails with ModuleNotFoundError: No module named 'skbuild'? Since opencv-python version 4.3.0.*, manylinux1 wheels were replaced by manylinux2014 wheels. If your pip is too old, it will try to use the new source distribution introduced in 4.3.0.38 to manually build OpenCV because it does not know how to install manylinux2014 wheels. However, source build will also fail because of too old pip because it does not understand build dependencies in pyproject.toml. To use the new manylinux2014 pre-built wheels (or to build from source), your pip version must be >= 19.3. Please upgrade pip with pip install --upgrade pip.		2020-09-03 18:01:20 +02:00
data@0cc78464e7	🧹 Update Calamari model path	2020-08-05 12:27:05 +02:00
ocrd-bugs	🐛 ocrd-bugs: Most/All workspaces in bag files don't validate	2019-10-09 13:36:54 +02:00
.gitignore	🧹 .gitignore __pychache__/*.pyc	2020-06-18 10:51:43 +02:00
.gitmodules	✨ Run Calamari OCR	2019-08-21 11:54:01 +02:00
.travis.yml	🚧 Travis: Fix tagging/pushing/pulling the correct images	2020-08-24 18:45:30 +02:00
build	⚙️ Cache builds from previous build (for Travis)	2020-09-01 13:09:37 +02:00
Dockerfile-core	🧹 Move one-liner ocrd_logging.py to an echo statement	2020-08-24 19:35:08 +02:00
Dockerfile-dinglehopper	🎨 Rename boxed-* to my_ocrd_workflow-*	2020-08-14 17:52:57 +02:00
Dockerfile-ocrd_calamari	🎨 Rename boxed-* to my_ocrd_workflow-*	2020-08-14 17:52:57 +02:00
Dockerfile-ocrd_olena	🎨 Rename boxed-* to my_ocrd_workflow-*	2020-08-14 17:52:57 +02:00
Dockerfile-ocrd_tesserocr	🎨 Rename boxed-* to my_ocrd_workflow-*	2020-08-14 17:52:57 +02:00
Dockerfile-sbb_textline_detector	🎨 Rename boxed-* to my_ocrd_workflow-*	2020-08-14 17:52:57 +02:00
my_ocrd_workflow	⚙️ Consistently set LOG_LEVEL to INFO by default	2020-09-01 11:57:18 +02:00
ppn2ocr	🎨 ppn2ocr: Fix some whitespace code style issues	2020-09-03 17:18:42 +02:00
qurator_data_lib.sh	⬆️ Update qurator_data_lib.sh to allow not unpacking a downloaded file	2020-08-05 12:01:41 +02:00
README-DEV.md	📝 README: Mention qurator-spk and Docker Hub	2020-08-25 19:24:25 +02:00
README.md	📝 ppn2ocr: Mention upgrading pip	2020-09-03 18:01:20 +02:00
requirements-ppn2ocr.txt	🚧 Add a script that checks FULLTEXT dimensions against BEST dimensions	2020-06-18 10:49:31 +02:00
run	⚙️ Consistently set LOG_LEVEL to INFO by default	2020-09-01 11:57:18 +02:00
run-docker-hub	✨ Allow running pre-built containers again	2020-08-25 16:38:22 +02:00
zdb2ocr	🚧 zdb2ocr: Add TODOs from notes.md	2020-05-22 13:49:34 +02:00

README.md

My OCR-D workflow

WIP. Given a OCR-D workspace with document images in the OCR-D-IMG file group, this workflow produces:

Binarized images
Line segmentation
OCR text (using Calamari and Tesseract, both with GT4HistOCR models)
(Given ground truth in OCR-D-GT-PAGE, also an OCR text evaluation report)

If you're interested in the exact processors, versions and parameters, please take a look at the script and possibly the individual Dockerfiles.

Goal

Provide a test environment to produce OCR output for historical prints, using OCR-D, especially ocrd_calamari and sbb_textline_detection, including all dependencies in Docker.

How to use

It's easiest to use it as pre-built containers. To run the containers on an example workspace:

# Download an example workspace
cd /tmp
wget https://qurator-data.de/examples/actevedef_718448162.first-page.zip
unzip actevedef_718448162.first-page.zip

# Run the workflow on it
cd actevedef_718448162.first-page
~/devel/my_ocrd_workflow/run-docker-hub

Build the containers yourself

To build the containers yourself using Docker:

cd ~/devel/my_ocrd_workflow
./build

You may then use the script run to use your self-built containers, analogous to the example above.

Viewing results

You may then examine the results using PRImA's PAGE Viewer:

java -jar /path/to/JPageViewer.jar \
  --resolve-dir . \
  OCR-D-OCR-CALAMARI/OCR-D-OCR-CALAMARI_00000024.xml

The workflow also produces OCR evaluation reports using dinglehopper, if ground truth was available:

firefox OCR-D-OCR-CALAMARI-EVAL/OCR-D-OCR-CALAMARI-EVAL_00000024.html

ppn2ocr

The ppn2ocr script produces a METS file with the best images for a given document in the State Library Berlin (SBB)'s digitized collection.

Install it with an up-to-date pip (otherwise this will fail due to a opencv-python-headless build failure):

pip install -r ~/devel/my_ocrd_workflow/requirements-ppn2ocr.txt

The document must be specified by its PPN, for example:

~/devel/my_ocrd_workflow/ppn2ocr PPN77164308X
cd PPN77164308X
~/devel/my_ocrd_workflow/run-docker-hub -I BEST --skip-validation

This produces a workspace directory PPN77164308X with the OCR results in it; the results are viewable as explained above.

ppn2ocr requires a working Docker setup and properly set up environment variables for the proxy configuration. At SBB, please read howto/docker-proxy.md and howto/proxy-settings-for-shell+python.md (in qurator's mono-repo).