mirror of https://github.com/qurator-spk/ocrd-galley.git synced 2025-07-27 13:49:53 +02:00

No description

Find a file

Gerber, Mike 691be243f6 Some checks failed continuous-integration/drone/push Build is failing Details ✨ Use MAX file group name instead of BEST We were using the file group name BEST for what Kitodo seems to call MAX by convention. So we use MAX now. Currently, we work under the assumption that, if MAX exists in the METS retrieved by OAI-PMH, it's not what we want and we replace it with our own IIIF URLS with full size. Fixes GH-43.		2021-02-18 16:34:25 +01:00
data@3fbdbcf368	🚧 Add a. our augmented GT4HistOCR Calamari model b. chreul's GT4HistOCR model	2020-11-30 17:52:24 +01:00
wrapper	✨ Add support for ocrd_wrap aka ocrd-skimage-*	2021-02-16 18:42:08 +01:00
.drone.star	✨ Add support for ocrd_wrap aka ocrd-skimage-*	2021-02-16 18:42:08 +01:00
.gitignore	🚧 Add a wrapper script to call containers	2020-11-17 16:39:47 +01:00
.gitmodules	✨ Run Calamari OCR	2019-08-21 11:54:01 +02:00
build	⚙️ Also use quratorspk for the manual build	2021-02-15 17:10:27 +01:00
build-tmp-XXX	🚧 Try out Drone CI	2021-02-11 16:46:31 +01:00
Dockerfile-core	🧹 Don't use pip option --use-feature=2020-resolver anymore	2020-12-03 18:45:56 +01:00
Dockerfile-core-cuda10.0	🚧 Support CUDA	2021-01-15 20:22:37 +01:00
Dockerfile-core-cuda10.1	🚧 Support CUDA	2021-01-15 20:22:37 +01:00
Dockerfile-dinglehopper	🐳 Upload images to quratorspk/ocrd-galley*	2021-02-15 14:09:15 +01:00
Dockerfile-ocrd_calamari	🐛 Update ocrd_calamari to 1.0.2, to fix word coordinates	2021-02-16 12:21:06 +01:00
Dockerfile-ocrd_calamari03	🐳 Upload images to quratorspk/ocrd-galley*	2021-02-15 14:09:15 +01:00
Dockerfile-ocrd_cis	🐳 Upload images to quratorspk/ocrd-galley*	2021-02-15 14:09:15 +01:00
Dockerfile-ocrd_fileformat	Revert "👷🏾‍♂️ Use ocrd_fileformat#28 for now"	2021-02-15 16:29:58 +01:00
Dockerfile-ocrd_olena	🐳 Upload images to quratorspk/ocrd-galley*	2021-02-15 14:09:15 +01:00
Dockerfile-ocrd_segment	🐳 Upload images to quratorspk/ocrd-galley*	2021-02-15 14:09:15 +01:00
Dockerfile-ocrd_tesserocr	🐳 Upload images to quratorspk/ocrd-galley*	2021-02-15 14:09:15 +01:00
Dockerfile-ocrd_wrap	✨ Add support for ocrd_wrap aka ocrd-skimage-*	2021-02-16 18:42:08 +01:00
Dockerfile-sbb_binarization	🐳 Upload images to quratorspk/ocrd-galley*	2021-02-15 14:09:15 +01:00
Dockerfile-sbb_textline_detector	🐳 Upload images to quratorspk/ocrd-galley*	2021-02-15 14:09:15 +01:00
LICENSE	⚖️ Add a LICENSE file	2021-02-11 16:21:28 +01:00
my_ocrd_workflow	⬆️ Update ocrd_fileformat + use in default workflow to convert to ALTO	2021-02-04 17:36:55 +01:00
my_ocrd_workflow-sbb	⚙️ my_ocrd_workflow-sbb: Only convert Calamari OCR	2021-02-15 16:15:59 +01:00
ocrd-workspace-from-images	✨ ocrd-workspace-from-images	2020-10-09 16:50:12 +02:00
ppn2ocr	✨ Use MAX file group name instead of BEST	2021-02-18 16:34:25 +01:00
qurator_data_lib.sh	🚧 Try out Drone CI	2021-02-11 16:48:52 +01:00
README-DEV.md	✨ Add support for ocrd_wrap aka ocrd-skimage-*	2021-02-16 18:42:08 +01:00
README.md	✨ Use MAX file group name instead of BEST	2021-02-18 16:34:25 +01:00
requirements-ppn2ocr.txt	🚧 Add a script that checks FULLTEXT dimensions against BEST dimensions	2020-06-18 10:49:31 +02:00
zdb2ocr	🚧 zdb2ocr: Add TODOs from notes.md	2020-05-22 13:49:34 +02:00

README.md

ocrd-galley

A Dockerized test environment for OCR-D processors 🚢

WIP. Given a OCR-D workspace with document images in the OCR-D-IMG file group, the example workflow produces:

Binarized images
Line segmentation
OCR text (using Calamari and Tesseract, both with GT4HistOCR models)
(Given ground truth in OCR-D-GT-PAGE, also an OCR text evaluation report)

If you're interested in the exact processors, versions and parameters, please take a look at the script and possibly the individual Dockerfiles.

Goal

Provide a test environment to produce OCR output for historical prints, using OCR-D, especially ocrd_calamari and sbb_textline_detection, including all dependencies in Docker.

How to use

Currently, due to problems with the Travis CI, we do not provide pre-built containers anymore.*

To build the containers yourself using Docker:

cd ~/devel/ocrd-galley/
./build

You can then install the wrappers into a Python venv:

cd ~/devel/ocrd-galley/wrapper
pip install .

You may then use the script my_ocrd_workflow to use your self-built containers on an example workspace:

# Download an example workspace
cd /tmp
wget https://qurator-data.de/examples/actevedef_718448162.first-page.zip
unzip actevedef_718448162.first-page.zip

# Run the workflow on it
cd actevedef_718448162.first-page
~/devel/ocrd-galley/my_ocrd_workflow

Viewing results

You may then examine the results using PRImA's PAGE Viewer:

java -jar /path/to/JPageViewer.jar \
  --resolve-dir . \
  OCR-D-OCR-CALAMARI/OCR-D-OCR-CALAMARI_00000024.xml

The example workflow also produces OCR evaluation reports using dinglehopper, if ground truth was available:

firefox OCR-D-OCR-CALAMARI-EVAL/OCR-D-OCR-CALAMARI-EVAL_00000024.html

ppn2ocr

The ppn2ocr script produces a workspace and METS file with the best images for a given document in the Berlin State Library (SBB)'s digitized collection.

Install it with an up-to-date pip (otherwise this will fail due to a opencv-python-headless build failure):

pip install -r ~/devel/ocrd-galley/requirements-ppn2ocr.txt

The document must be specified by its PPN, for example:

~/devel/ocrd-galley/ppn2ocr PPN77164308X
cd PPN77164308X
~/devel/ocrd-galley/my_ocrd_workflow -I MAX --skip-validation

This produces a workspace directory PPN77164308X with the OCR results in it; the results are viewable as explained above.

ppn2ocr requires properly set up environment variables for the proxy configuration. At SBB, please read howto/docker-proxy.md and howto/proxy-settings-for-shell+python.md (in qurator's mono-repo).

ocrd-workspace-from-images

The ocrd-workspace-from-images script produces a OCR-D workspace (incl. METS) for the given images.

~/devel/ocrd-galley/ocrd-workspace-from-images 0005.png
cd workspace-xxxxx  # output by the last command
~/devel/ocrd-galley/my_ocrd_workflow

This produces a workspace from the files and then runs the OCR workflow on it.