You cannot select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
 
 
 
Go to file
Gerber, Mike 1a532b1ccc Validate PPN argument
ppn2ocr expects the PPN to be in the PPNxxxxxxx format, i.e. including
the leading 'PPN' string. Validate the argument accordingly.
4 years ago
data@0cc78464e7 🧹 Update Calamari model path 4 years ago
ocrd-bugs 🐛 ocrd-bugs: Most/All workspaces in bag files don't validate 5 years ago
.gitignore 🧹 .gitignore __pychache__/*.pyc 5 years ago
.gitmodules Run Calamari OCR 5 years ago
.travis.yml 🚧 Travis: Fix tagging/pushing/pulling the correct images 4 years ago
Dockerfile-core 🧹 Move one-liner ocrd_logging.py to an echo statement 4 years ago
Dockerfile-dinglehopper 🎨 Rename boxed-* to my_ocrd_workflow-* 4 years ago
Dockerfile-ocrd_calamari 🎨 Rename boxed-* to my_ocrd_workflow-* 4 years ago
Dockerfile-ocrd_olena 🎨 Rename boxed-* to my_ocrd_workflow-* 4 years ago
Dockerfile-ocrd_tesserocr 🎨 Rename boxed-* to my_ocrd_workflow-* 4 years ago
Dockerfile-sbb_textline_detector 🎨 Rename boxed-* to my_ocrd_workflow-* 4 years ago
README-DEV.md 📝 README: Mention qurator-spk and Docker Hub 4 years ago
README.md 📝 README: Fix SBB documentation comment 4 years ago
build ⚙️ Cache builds from previous build (for Travis) 4 years ago
my_ocrd_workflow ⚙️ Consistently set LOG_LEVEL to INFO by default 4 years ago
ppn2ocr Validate PPN argument 4 years ago
qurator_data_lib.sh ⬆️ Update qurator_data_lib.sh to allow not unpacking a downloaded file 4 years ago
requirements-ppn2ocr.txt 🚧 Add a script that checks FULLTEXT dimensions against BEST dimensions 5 years ago
run ⚙️ Consistently set LOG_LEVEL to INFO by default 4 years ago
run-docker-hub Allow running pre-built containers again 4 years ago
zdb2ocr 🚧 zdb2ocr: Add TODOs from notes.md 5 years ago

README.md

My OCR-D workflow

Build Status

WIP. Given a OCR-D workspace with document images in the OCR-D-IMG file group, this workflow produces:

  • Binarized images
  • Line segmentation
  • OCR text (using Calamari and Tesseract, both with GT4HistOCR models)
  • (Given ground truth in OCR-D-GT-PAGE, also an OCR text evaluation report)

If you're interested in the exact processors, versions and parameters, please take a look at the script and possibly the individual Dockerfiles.

Goal

Provide a test environment to produce OCR output for historical prints, using OCR-D, especially ocrd_calamari and sbb_textline_detection, including all dependencies in Docker.

How to use

It's easiest to use it as pre-built containers. To run the containers on an example workspace:

# Download an example workspace
cd /tmp
wget https://qurator-data.de/examples/actevedef_718448162.first-page.zip
unzip actevedef_718448162.first-page.zip

# Run the workflow on it
cd actevedef_718448162.first-page
~/devel/my_ocrd_workflow/run-docker-hub

Build the containers yourself

To build the containers yourself using Docker:

cd ~/devel/my_ocrd_workflow
./build

You may then use the script run to use your self-built containers, analogous to the example above.

Viewing results

You may then examine the results using PRImA's PAGE Viewer:

java -jar /path/to/JPageViewer.jar \
  --resolve-dir . \
  OCR-D-OCR-CALAMARI/OCR-D-OCR-CALAMARI_00000024.xml

The workflow also produces OCR evaluation reports using dinglehopper, if ground truth was available:

firefox OCR-D-OCR-CALAMARI-EVAL/OCR-D-OCR-CALAMARI-EVAL_00000024.html

ppn2ocr

The ppn2ocr script produces a METS file with the best images for a given document in the State Library Berlin (SBB)'s digitized collection. The document must be specified by its PPN, for example:

pip install -r ~/devel/my_ocrd_workflow/requirements-ppn2ocr.txt
~/devel/my_ocrd_workflow/ppn2ocr PPN77164308X
cd PPN77164308X
~/devel/my_ocrd_workflow/run-docker-hub -I BEST --skip-validation

This produces a workspace directory PPN77164308X with the OCR results in it; the results are viewable as explained above.

ppn2ocr requires a working Docker setup and properly set up environment variables for the proxy configuration. At SBB, please read howto/docker-proxy.md and howto/proxy-settings-for-shell+python.md (in qurator's mono-repo).