You cannot select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
 
 
 
Go to file
Gerber, Mike 68f6e1609b Allow (re)building only some container images
To speed up rebuilding container images, you can now supply the desired
subimage to build:

  ./build sbb_textline_detector

or even simpler, leveraging shell filename completion:

  ./build Dockerfile-sbb_textline_detector
4 years ago
.github/workflows 🚧 Test GitHub Actions build 4 years ago
data@3fbdbcf368 🚧 Add a. our augmented GT4HistOCR Calamari model b. chreul's GT4HistOCR model 4 years ago
wrapper ⬆️ ocrd_calamari 1.0.0 (plus changes for legacy ocrd_calmari) 4 years ago
.gitignore 🚧 Add a wrapper script to call containers 4 years ago
.gitmodules Run Calamari OCR 6 years ago
.travis.yml 🧹 Travis: Remove extra "set -e" 4 years ago
Dockerfile-core 🧹 Don't use pip option --use-feature=2020-resolver anymore 4 years ago
Dockerfile-core-cuda10.0 🚧 Support CUDA 4 years ago
Dockerfile-core-cuda10.1 🚧 Support CUDA 4 years ago
Dockerfile-dinglehopper 🧹 Don't use pip option --use-feature=2020-resolver anymore 4 years ago
Dockerfile-ocrd_calamari Merge branch 'master' of code.dev.sbb.berlin:qurator/ocrd-galley 4 years ago
Dockerfile-ocrd_calamari03 🧹 Don't use pip option --use-feature=2020-resolver anymore 4 years ago
Dockerfile-ocrd_cis 🧹 Don't use pip option --use-feature=2020-resolver anymore 4 years ago
Dockerfile-ocrd_fileformat 🧹 Don't use pip option --use-feature=2020-resolver anymore 4 years ago
Dockerfile-ocrd_olena 🧹 Don't use pip option --use-feature=2020-resolver anymore 4 years ago
Dockerfile-ocrd_tesserocr 🧹 Don't use pip option --use-feature=2020-resolver anymore 4 years ago
Dockerfile-sbb_binarization ⬆️ Update sbb_binarization for h5py fix 4 years ago
Dockerfile-sbb_textline_detector Update Dockerfile-sbb_textline_detector 4 years ago
README-DEV.md 📝 README-DEV: Also push master, not only tags... 4 years ago
README.md 🧹 Update README for the newest changes + clean-up 4 years ago
build Allow (re)building only some container images 4 years ago
my_ocrd_workflow 🐛 Fix model path for ocrd_calamari 1.0 4 years ago
ocrd-workspace-from-images ocrd-workspace-from-images 4 years ago
ppn2ocr 🎨 ppn2ocr: Fix some whitespace code style issues 4 years ago
qurator_data_lib.sh 🚧 Prepare supporting ocrd-sbb-binarize 4 years ago
requirements-ppn2ocr.txt 🚧 Add a script that checks FULLTEXT dimensions against BEST dimensions 5 years ago
zdb2ocr 🚧 zdb2ocr: Add TODOs from notes.md 5 years ago

README.md

ocrd-galley

Build Status

A Dockerized test environment for OCR-D processors 🚢

WIP. Given a OCR-D workspace with document images in the OCR-D-IMG file group, the example workflow produces:

  • Binarized images
  • Line segmentation
  • OCR text (using Calamari and Tesseract, both with GT4HistOCR models)
  • (Given ground truth in OCR-D-GT-PAGE, also an OCR text evaluation report)

If you're interested in the exact processors, versions and parameters, please take a look at the script and possibly the individual Dockerfiles.

Goal

Provide a test environment to produce OCR output for historical prints, using OCR-D, especially ocrd_calamari and sbb_textline_detection, including all dependencies in Docker.

How to use

Currently, due to problems with the Travis CI, we do not provide pre-built containers anymore.*

To build the containers yourself using Docker:

cd ~/devel/ocrd-galley/
./build

You can then install the wrappers into a Python venv:

cd ~/devel/ocrd-galley/wrapper
pip install .

You may then use the script my_ocrd_workflow to use your self-built containers on an example workspace:

# Download an example workspace
cd /tmp
wget https://qurator-data.de/examples/actevedef_718448162.first-page.zip
unzip actevedef_718448162.first-page.zip

# Run the workflow on it
cd actevedef_718448162.first-page
~/devel/ocrd-galley/my_ocrd_workflow

Viewing results

You may then examine the results using PRImA's PAGE Viewer:

java -jar /path/to/JPageViewer.jar \
  --resolve-dir . \
  OCR-D-OCR-CALAMARI/OCR-D-OCR-CALAMARI_00000024.xml

The example workflow also produces OCR evaluation reports using dinglehopper, if ground truth was available:

firefox OCR-D-OCR-CALAMARI-EVAL/OCR-D-OCR-CALAMARI-EVAL_00000024.html

ppn2ocr

The ppn2ocr script produces a workspace and METS file with the best images for a given document in the State Library Berlin (SBB)'s digitized collection.

Install it with an up-to-date pip (otherwise this will fail due to a opencv-python-headless build failure):

pip install -r ~/devel/ocrd-galley/requirements-ppn2ocr.txt

The document must be specified by its PPN, for example:

~/devel/ocrd-galley/ppn2ocr PPN77164308X
cd PPN77164308X
~/devel/ocrd-galley/my_ocrd_workflow -I BEST --skip-validation

This produces a workspace directory PPN77164308X with the OCR results in it; the results are viewable as explained above.

ppn2ocr requires properly set up environment variables for the proxy configuration. At SBB, please read howto/docker-proxy.md and howto/proxy-settings-for-shell+python.md (in qurator's mono-repo).

ocrd-workspace-from-images

The ocrd-workspace-from-images script produces a OCR-D workspace (incl. METS) for the given images.

~/devel/ocrd-galley/ocrd-workspace-from-images 0005.png
cd workspace-xxxxx  # output by the last command
~/devel/ocrd-galley/my_ocrd_workflow

This produces a workspace from the files and then runs the OCR workflow on it.