You cannot select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
 
 
 
Go to file
Gerber, Mike 691be243f6
continuous-integration/drone/push Build is failing Details
Use MAX file group name instead of BEST
We were using the file group name BEST for what Kitodo seems to call
MAX by convention. So we use MAX now.

Currently, we work under the assumption that, if MAX exists in the METS
retrieved by OAI-PMH, it's not what we want and we replace it with our
own IIIF URLS with full size.

Fixes GH-43.
4 years ago
data@3fbdbcf368 🚧 Add a. our augmented GT4HistOCR Calamari model b. chreul's GT4HistOCR model 4 years ago
wrapper Add support for ocrd_wrap aka ocrd-skimage-* 4 years ago
.drone.star Add support for ocrd_wrap aka ocrd-skimage-* 4 years ago
.gitignore 🚧 Add a wrapper script to call containers 4 years ago
.gitmodules Run Calamari OCR 5 years ago
Dockerfile-core 🧹 Don't use pip option --use-feature=2020-resolver anymore 4 years ago
Dockerfile-core-cuda10.0 🚧 Support CUDA 4 years ago
Dockerfile-core-cuda10.1 🚧 Support CUDA 4 years ago
Dockerfile-dinglehopper 🐳 Upload images to quratorspk/ocrd-galley* 4 years ago
Dockerfile-ocrd_calamari 🐛 Update ocrd_calamari to 1.0.2, to fix word coordinates 4 years ago
Dockerfile-ocrd_calamari03 🐳 Upload images to quratorspk/ocrd-galley* 4 years ago
Dockerfile-ocrd_cis 🐳 Upload images to quratorspk/ocrd-galley* 4 years ago
Dockerfile-ocrd_fileformat Revert "👷🏾‍♂️ Use ocrd_fileformat#28 for now" 4 years ago
Dockerfile-ocrd_olena 🐳 Upload images to quratorspk/ocrd-galley* 4 years ago
Dockerfile-ocrd_segment 🐳 Upload images to quratorspk/ocrd-galley* 4 years ago
Dockerfile-ocrd_tesserocr 🐳 Upload images to quratorspk/ocrd-galley* 4 years ago
Dockerfile-ocrd_wrap Add support for ocrd_wrap aka ocrd-skimage-* 4 years ago
Dockerfile-sbb_binarization 🐳 Upload images to quratorspk/ocrd-galley* 4 years ago
Dockerfile-sbb_textline_detector 🐳 Upload images to quratorspk/ocrd-galley* 4 years ago
LICENSE ⚖️ Add a LICENSE file 4 years ago
README-DEV.md Add support for ocrd_wrap aka ocrd-skimage-* 4 years ago
README.md Use MAX file group name instead of BEST 4 years ago
build ⚙️ Also use quratorspk for the manual build 4 years ago
build-tmp-XXX 🚧 Try out Drone CI 4 years ago
my_ocrd_workflow ⬆️ Update ocrd_fileformat + use in default workflow to convert to ALTO 4 years ago
my_ocrd_workflow-sbb ⚙️ my_ocrd_workflow-sbb: Only convert Calamari OCR 4 years ago
ocrd-workspace-from-images ocrd-workspace-from-images 4 years ago
ppn2ocr Use MAX file group name instead of BEST 4 years ago
qurator_data_lib.sh 🚧 Try out Drone CI 4 years ago
requirements-ppn2ocr.txt 🚧 Add a script that checks FULLTEXT dimensions against BEST dimensions 5 years ago
zdb2ocr 🚧 zdb2ocr: Add TODOs from notes.md 5 years ago

README.md

ocrd-galley

A Dockerized test environment for OCR-D processors 🚢

WIP. Given a OCR-D workspace with document images in the OCR-D-IMG file group, the example workflow produces:

  • Binarized images
  • Line segmentation
  • OCR text (using Calamari and Tesseract, both with GT4HistOCR models)
  • (Given ground truth in OCR-D-GT-PAGE, also an OCR text evaluation report)

If you're interested in the exact processors, versions and parameters, please take a look at the script and possibly the individual Dockerfiles.

Goal

Provide a test environment to produce OCR output for historical prints, using OCR-D, especially ocrd_calamari and sbb_textline_detection, including all dependencies in Docker.

How to use

Currently, due to problems with the Travis CI, we do not provide pre-built containers anymore.*

To build the containers yourself using Docker:

cd ~/devel/ocrd-galley/
./build

You can then install the wrappers into a Python venv:

cd ~/devel/ocrd-galley/wrapper
pip install .

You may then use the script my_ocrd_workflow to use your self-built containers on an example workspace:

# Download an example workspace
cd /tmp
wget https://qurator-data.de/examples/actevedef_718448162.first-page.zip
unzip actevedef_718448162.first-page.zip

# Run the workflow on it
cd actevedef_718448162.first-page
~/devel/ocrd-galley/my_ocrd_workflow

Viewing results

You may then examine the results using PRImA's PAGE Viewer:

java -jar /path/to/JPageViewer.jar \
  --resolve-dir . \
  OCR-D-OCR-CALAMARI/OCR-D-OCR-CALAMARI_00000024.xml

The example workflow also produces OCR evaluation reports using dinglehopper, if ground truth was available:

firefox OCR-D-OCR-CALAMARI-EVAL/OCR-D-OCR-CALAMARI-EVAL_00000024.html

ppn2ocr

The ppn2ocr script produces a workspace and METS file with the best images for a given document in the Berlin State Library (SBB)'s digitized collection.

Install it with an up-to-date pip (otherwise this will fail due to a opencv-python-headless build failure):

pip install -r ~/devel/ocrd-galley/requirements-ppn2ocr.txt

The document must be specified by its PPN, for example:

~/devel/ocrd-galley/ppn2ocr PPN77164308X
cd PPN77164308X
~/devel/ocrd-galley/my_ocrd_workflow -I MAX --skip-validation

This produces a workspace directory PPN77164308X with the OCR results in it; the results are viewable as explained above.

ppn2ocr requires properly set up environment variables for the proxy configuration. At SBB, please read howto/docker-proxy.md and howto/proxy-settings-for-shell+python.md (in qurator's mono-repo).

ocrd-workspace-from-images

The ocrd-workspace-from-images script produces a OCR-D workspace (incl. METS) for the given images.

~/devel/ocrd-galley/ocrd-workspace-from-images 0005.png
cd workspace-xxxxx  # output by the last command
~/devel/ocrd-galley/my_ocrd_workflow

This produces a workspace from the files and then runs the OCR workflow on it.