No description
Find a file
Gerber, Mike 691be243f6
Some checks failed
continuous-integration/drone/push Build is failing
Use MAX file group name instead of BEST
We were using the file group name BEST for what Kitodo seems to call
MAX by convention. So we use MAX now.

Currently, we work under the assumption that, if MAX exists in the METS
retrieved by OAI-PMH, it's not what we want and we replace it with our
own IIIF URLS with full size.

Fixes GH-43.
2021-02-18 16:34:25 +01:00
data@3fbdbcf368 🚧 Add a. our augmented GT4HistOCR Calamari model b. chreul's GT4HistOCR model 2020-11-30 17:52:24 +01:00
wrapper Add support for ocrd_wrap aka ocrd-skimage-* 2021-02-16 18:42:08 +01:00
.drone.star Add support for ocrd_wrap aka ocrd-skimage-* 2021-02-16 18:42:08 +01:00
.gitignore 🚧 Add a wrapper script to call containers 2020-11-17 16:39:47 +01:00
.gitmodules Run Calamari OCR 2019-08-21 11:54:01 +02:00
build ⚙️ Also use quratorspk for the manual build 2021-02-15 17:10:27 +01:00
build-tmp-XXX 🚧 Try out Drone CI 2021-02-11 16:46:31 +01:00
Dockerfile-core 🧹 Don't use pip option --use-feature=2020-resolver anymore 2020-12-03 18:45:56 +01:00
Dockerfile-core-cuda10.0 🚧 Support CUDA 2021-01-15 20:22:37 +01:00
Dockerfile-core-cuda10.1 🚧 Support CUDA 2021-01-15 20:22:37 +01:00
Dockerfile-dinglehopper 🐳 Upload images to quratorspk/ocrd-galley* 2021-02-15 14:09:15 +01:00
Dockerfile-ocrd_calamari 🐛 Update ocrd_calamari to 1.0.2, to fix word coordinates 2021-02-16 12:21:06 +01:00
Dockerfile-ocrd_calamari03 🐳 Upload images to quratorspk/ocrd-galley* 2021-02-15 14:09:15 +01:00
Dockerfile-ocrd_cis 🐳 Upload images to quratorspk/ocrd-galley* 2021-02-15 14:09:15 +01:00
Dockerfile-ocrd_fileformat Revert "👷🏾‍♂️ Use ocrd_fileformat#28 for now" 2021-02-15 16:29:58 +01:00
Dockerfile-ocrd_olena 🐳 Upload images to quratorspk/ocrd-galley* 2021-02-15 14:09:15 +01:00
Dockerfile-ocrd_segment 🐳 Upload images to quratorspk/ocrd-galley* 2021-02-15 14:09:15 +01:00
Dockerfile-ocrd_tesserocr 🐳 Upload images to quratorspk/ocrd-galley* 2021-02-15 14:09:15 +01:00
Dockerfile-ocrd_wrap Add support for ocrd_wrap aka ocrd-skimage-* 2021-02-16 18:42:08 +01:00
Dockerfile-sbb_binarization 🐳 Upload images to quratorspk/ocrd-galley* 2021-02-15 14:09:15 +01:00
Dockerfile-sbb_textline_detector 🐳 Upload images to quratorspk/ocrd-galley* 2021-02-15 14:09:15 +01:00
LICENSE ⚖️ Add a LICENSE file 2021-02-11 16:21:28 +01:00
my_ocrd_workflow ⬆️ Update ocrd_fileformat + use in default workflow to convert to ALTO 2021-02-04 17:36:55 +01:00
my_ocrd_workflow-sbb ⚙️ my_ocrd_workflow-sbb: Only convert Calamari OCR 2021-02-15 16:15:59 +01:00
ocrd-workspace-from-images ocrd-workspace-from-images 2020-10-09 16:50:12 +02:00
ppn2ocr Use MAX file group name instead of BEST 2021-02-18 16:34:25 +01:00
qurator_data_lib.sh 🚧 Try out Drone CI 2021-02-11 16:48:52 +01:00
README-DEV.md Add support for ocrd_wrap aka ocrd-skimage-* 2021-02-16 18:42:08 +01:00
README.md Use MAX file group name instead of BEST 2021-02-18 16:34:25 +01:00
requirements-ppn2ocr.txt 🚧 Add a script that checks FULLTEXT dimensions against BEST dimensions 2020-06-18 10:49:31 +02:00
zdb2ocr 🚧 zdb2ocr: Add TODOs from notes.md 2020-05-22 13:49:34 +02:00

ocrd-galley

A Dockerized test environment for OCR-D processors 🚢

WIP. Given a OCR-D workspace with document images in the OCR-D-IMG file group, the example workflow produces:

  • Binarized images
  • Line segmentation
  • OCR text (using Calamari and Tesseract, both with GT4HistOCR models)
  • (Given ground truth in OCR-D-GT-PAGE, also an OCR text evaluation report)

If you're interested in the exact processors, versions and parameters, please take a look at the script and possibly the individual Dockerfiles.

Goal

Provide a test environment to produce OCR output for historical prints, using OCR-D, especially ocrd_calamari and sbb_textline_detection, including all dependencies in Docker.

How to use

Currently, due to problems with the Travis CI, we do not provide pre-built containers anymore.*

To build the containers yourself using Docker:

cd ~/devel/ocrd-galley/
./build

You can then install the wrappers into a Python venv:

cd ~/devel/ocrd-galley/wrapper
pip install .

You may then use the script my_ocrd_workflow to use your self-built containers on an example workspace:

# Download an example workspace
cd /tmp
wget https://qurator-data.de/examples/actevedef_718448162.first-page.zip
unzip actevedef_718448162.first-page.zip

# Run the workflow on it
cd actevedef_718448162.first-page
~/devel/ocrd-galley/my_ocrd_workflow

Viewing results

You may then examine the results using PRImA's PAGE Viewer:

java -jar /path/to/JPageViewer.jar \
  --resolve-dir . \
  OCR-D-OCR-CALAMARI/OCR-D-OCR-CALAMARI_00000024.xml

The example workflow also produces OCR evaluation reports using dinglehopper, if ground truth was available:

firefox OCR-D-OCR-CALAMARI-EVAL/OCR-D-OCR-CALAMARI-EVAL_00000024.html

ppn2ocr

The ppn2ocr script produces a workspace and METS file with the best images for a given document in the Berlin State Library (SBB)'s digitized collection.

Install it with an up-to-date pip (otherwise this will fail due to a opencv-python-headless build failure):

pip install -r ~/devel/ocrd-galley/requirements-ppn2ocr.txt

The document must be specified by its PPN, for example:

~/devel/ocrd-galley/ppn2ocr PPN77164308X
cd PPN77164308X
~/devel/ocrd-galley/my_ocrd_workflow -I MAX --skip-validation

This produces a workspace directory PPN77164308X with the OCR results in it; the results are viewable as explained above.

ppn2ocr requires properly set up environment variables for the proxy configuration. At SBB, please read howto/docker-proxy.md and howto/proxy-settings-for-shell+python.md (in qurator's mono-repo).

ocrd-workspace-from-images

The ocrd-workspace-from-images script produces a OCR-D workspace (incl. METS) for the given images.

~/devel/ocrd-galley/ocrd-workspace-from-images 0005.png
cd workspace-xxxxx  # output by the last command
~/devel/ocrd-galley/my_ocrd_workflow

This produces a workspace from the files and then runs the OCR workflow on it.