ocrd-galley/README.md

ocrd-galley
===========

<!--
[![Build Status](https://travis-ci.com/qurator-spk/ocrd-galley.svg?branch=master)](https://travis-ci.com/qurator-spk/ocrd-galley)
-->

A Dockerized test environment for OCR-D processors 🚢

WIP. Given a OCR-D workspace with document images in the OCR-D-IMG file group,
the example workflow produces:

* Binarized images
* Line segmentation
* OCR text (using Calamari and Tesseract, both with GT4HistOCR models)
* (Given ground truth in OCR-D-GT-PAGE, also an OCR text evaluation report)

If you're interested in the exact processors, versions and parameters, please
take a look at the [script](my_ocrd_workflow) and possibly the individual
Dockerfiles.

Goal
----
Provide a **test environment** to produce OCR output for historical prints,
using OCR-D, especially [ocrd_calamari](https://github.com/OCR-D/ocrd_calamari)
and
[sbb_textline_detection](https://github.com/qurator-spk/sbb_textline_detection),
including all dependencies in Docker.

How to use
----------
ocrd-galley uses Docker to run the OCR-D images. We provide pre-built container
images that get downloaded automatically when you run the provided wrappers for
the OCR-D processors.

You can then install the wrappers into a Python venv:
~~~
cd ~/devel/ocrd-galley/wrapper
pip install .
~~~

To download models, you need to use the `-a` flag of `ocrd resmgr`:
~~~
ocrd resmgr download -a ocrd-calamari-recognize default
~~~

You may then use the script `my_ocrd_workflow` to use your self-built
containers on an example workspace:
~~~
# Download an example workspace
cd /tmp
wget https://qurator-data.de/examples/actevedef_718448162.first-page.zip
unzip actevedef_718448162.first-page.zip

# Run the workflow on it
cd actevedef_718448162.first-page
~/devel/ocrd-galley/my_ocrd_workflow
~~~

### Viewing results
You may then examine the results using
[PRImA's PAGE Viewer](https://www.primaresearch.org/tools/PAGEViewer):
~~~
java -jar /path/to/JPageViewer.jar \
  --resolve-dir . \
  OCR-D-OCR-CALAMARI/OCR-D-OCR-CALAMARI_00000024.xml
~~~

The example workflow also produces OCR evaluation reports using
[dinglehopper](https://github.com/qurator-spk/dinglehopper), if ground truth was
available:
~~~
firefox OCR-D-OCR-CALAMARI-EVAL/OCR-D-OCR-CALAMARI-EVAL_00000024.html
~~~

ppn2ocr
-------
The `ppn2ocr` script produces a workspace and METS file with the best images for
a given document in the Berlin State Library (SBB)'s digitized collection.

Install it with an up-to-date pip (otherwise this will fail due to [a opencv-python-headless build failure](https://github.com/skvark/opencv-python#frequently-asked-questions)):
~~~
pip install -r ~/devel/ocrd-galley/requirements-ppn2ocr.txt
~~~

The document must be specified by its PPN, for example:
~~~
~/devel/ocrd-galley/ppn2ocr PPN77164308X
cd PPN77164308X
~/devel/ocrd-galley/my_ocrd_workflow -I MAX --skip-validation
~~~

This produces a workspace directory `PPN77164308X` with the OCR results in it;
the results are viewable as explained above.

ppn2ocr requires properly set up environment variables for the proxy
configuration. At SBB, please read `howto/docker-proxy.md` and
`howto/proxy-settings-for-shell+python.md` (in qurator's mono-repo).

ocrd-workspace-from-images
--------------------------
The `ocrd-workspace-from-images` script produces a OCR-D workspace (incl. METS)
for the given images.

~~~
~/devel/ocrd-galley/ocrd-workspace-from-images 0005.png
cd workspace-xxxxx  # output by the last command
~/devel/ocrd-galley/my_ocrd_workflow
~~~

This produces a workspace from the files and then runs the OCR workflow on it.

Build the containers yourself
-----------------------------
To build the containers yourself using Docker:
~~~
cd ~/devel/ocrd-galley/
./build
~~~
📝 README: Rename to ocrd-galley 4 years ago			`ocrd-galley`
			`===========`
📜 Add README.md 5 years ago
🚧 Remove Travis and GitHub Actions builds for now 3 years ago			`<!--`
📝 README: Rename to ocrd-galley 4 years ago			`[![Build Status](https://travis-ci.com/qurator-spk/ocrd-galley.svg?branch=master)](https://travis-ci.com/qurator-spk/ocrd-galley)`
🚧 Remove Travis and GitHub Actions builds for now 3 years ago			`-->`
📝 README: Rename to ocrd-galley 4 years ago
			`A Dockerized test environment for OCR-D processors 🚢`
✅ Travis: Add status badge 4 years ago
📝 README: Describe what this does and why 4 years ago			`WIP. Given a OCR-D workspace with document images in the OCR-D-IMG file group,`
📝 README: Use "the example workflow" instead of "the workflow" 4 years ago			`the example workflow produces:`
📜 Add README.md 5 years ago
📝 README: Describe what this does and why 4 years ago			`* Binarized images`
			`* Line segmentation`
			`* OCR text (using Calamari and Tesseract, both with GT4HistOCR models)`
			`* (Given ground truth in OCR-D-GT-PAGE, also an OCR text evaluation report)`

📝 README: Update to reflect that we are now using multiple containers 4 years ago			`If you're interested in the exact processors, versions and parameters, please`
			`take a look at the [script](my_ocrd_workflow) and possibly the individual`
			`Dockerfiles.`
📝 README: Describe what this does and why 4 years ago
			`Goal`
			`----`
📝 README: Update to reflect that we are now using multiple containers 4 years ago			`Provide a test environment to produce OCR output for historical prints,`
			`using OCR-D, especially [ocrd_calamari](https://github.com/OCR-D/ocrd_calamari)`
			`and`
			`[sbb_textline_detection](https://github.com/qurator-spk/sbb_textline_detection),`
			`including all dependencies in Docker.`
📝 README: Describe what this does and why 4 years ago
			`How to use`
			`----------`
✒ README: Update that we have images again + how to download models 1 year ago			`ocrd-galley uses Docker to run the OCR-D images. We provide pre-built container`
			`images that get downloaded automatically when you run the provided wrappers for`
			`the OCR-D processors.`
🐳 run-docker-hub-update: Update to the latest containers 4 years ago
🧹 Update README for the newest changes + clean-up 3 years ago			`You can then install the wrappers into a Python venv:`
			`~~~`
			`cd ~/devel/ocrd-galley/wrapper`
			`pip install .`
			`~~~`

✒ README: Update that we have images again + how to download models 1 year ago			To download models, you need to use the `-a` flag of `ocrd resmgr`:
			`~~~`
			`ocrd resmgr download -a ocrd-calamari-recognize default`
			`~~~`

🧹 Update README for the newest changes + clean-up 3 years ago			You may then use the script `my_ocrd_workflow` to use your self-built
			`containers on an example workspace:`
			`~~~`
📝 README: Include example workspace + reference PAGE Viewer and dinglehopper Fixes GH-7. 4 years ago			`# Download an example workspace`
			`cd /tmp`
			`wget https://qurator-data.de/examples/actevedef_718448162.first-page.zip`
			`unzip actevedef_718448162.first-page.zip`

			`# Run the workflow on it`
			`cd actevedef_718448162.first-page`
🧹 Update README for the newest changes + clean-up 3 years ago			`~/devel/ocrd-galley/my_ocrd_workflow`
📜 Add README.md 5 years ago			`~~~`
📝 README: Include example workspace + reference PAGE Viewer and dinglehopper Fixes GH-7. 4 years ago
📝 README: Describe what this does and why 4 years ago			`### Viewing results`
📝 README: Include example workspace + reference PAGE Viewer and dinglehopper Fixes GH-7. 4 years ago			`You may then examine the results using`
			`[PRImA's PAGE Viewer](https://www.primaresearch.org/tools/PAGEViewer):`
			`~~~`
🗒️ README: Break jpageviewer line 4 years ago			`java -jar /path/to/JPageViewer.jar \`
			`--resolve-dir . \`
			`OCR-D-OCR-CALAMARI/OCR-D-OCR-CALAMARI_00000024.xml`
📝 README: Include example workspace + reference PAGE Viewer and dinglehopper Fixes GH-7. 4 years ago			`~~~`

📝 README: Use "the example workflow" instead of "the workflow" 4 years ago			`The example workflow also produces OCR evaluation reports using`
📝 README: Include example workspace + reference PAGE Viewer and dinglehopper Fixes GH-7. 4 years ago			`[dinglehopper](https://github.com/qurator-spk/dinglehopper), if ground truth was`
📝 README: Describe what this does and why 4 years ago			`available:`
📝 README: Include example workspace + reference PAGE Viewer and dinglehopper Fixes GH-7. 4 years ago			`~~~`
			`firefox OCR-D-OCR-CALAMARI-EVAL/OCR-D-OCR-CALAMARI-EVAL_00000024.html`
			`~~~`
📝 ppn2ocr: Add to README, including proxy configuration 4 years ago
			`ppn2ocr`
			`-------`
✨ ocrd-workspace-from-images 4 years ago			The `ppn2ocr` script produces a workspace and METS file with the best images for
Update README.md use official English name for SBB 3 years ago			`a given document in the Berlin State Library (SBB)'s digitized collection.`
📝 README: Mention venv when installing ppn2ocr 4 years ago
📝 ppn2ocr: Mention upgrading pip Installing the ppn2ocr requirements fails when pip is not updated (see issue below), so mention the issue in the README. Quoting https://github.com/skvark/opencv-python#frequently-asked-questions: Q: Pip install fails with ModuleNotFoundError: No module named 'skbuild'? Since opencv-python version 4.3.0.*, manylinux1 wheels were replaced by manylinux2014 wheels. If your pip is too old, it will try to use the new source distribution introduced in 4.3.0.38 to manually build OpenCV because it does not know how to install manylinux2014 wheels. However, source build will also fail because of too old pip because it does not understand build dependencies in pyproject.toml. To use the new manylinux2014 pre-built wheels (or to build from source), your pip version must be >= 19.3. Please upgrade pip with pip install --upgrade pip. 4 years ago			`Install it with an up-to-date pip (otherwise this will fail due to [a opencv-python-headless build failure](https://github.com/skvark/opencv-python#frequently-asked-questions)):`
📝 ppn2ocr: Add to README, including proxy configuration 4 years ago			`~~~`
📝 README: s/my_ocrd_workflow/ocrd-galley (where appropiate) 4 years ago			`pip install -r ~/devel/ocrd-galley/requirements-ppn2ocr.txt`
📝 README: Mention venv when installing ppn2ocr 4 years ago			`~~~`

			`The document must be specified by its PPN, for example:`
			`~~~`
📝 README: s/my_ocrd_workflow/ocrd-galley (where appropiate) 4 years ago			`~/devel/ocrd-galley/ppn2ocr PPN77164308X`
🚧 ppn2ocr: Update README 4 years ago			`cd PPN77164308X`
✨ Use MAX file group name instead of BEST We were using the file group name BEST for what Kitodo seems to call MAX by convention. So we use MAX now. Currently, we work under the assumption that, if MAX exists in the METS retrieved by OAI-PMH, it's not what we want and we replace it with our own IIIF URLS with full size. Fixes GH-43. 3 years ago			`~/devel/ocrd-galley/my_ocrd_workflow -I MAX --skip-validation`
📝 ppn2ocr: Add to README, including proxy configuration 4 years ago			`~~~`

			This produces a workspace directory `PPN77164308X` with the OCR results in it;
			`the results are viewable as explained above.`

✨ ocrd-workspace-from-images 4 years ago			`ppn2ocr requires properly set up environment variables for the proxy`
			configuration. At SBB, please read `howto/docker-proxy.md` and
			`howto/proxy-settings-for-shell+python.md` (in qurator's mono-repo).

			`ocrd-workspace-from-images`
			`--------------------------`
			The `ocrd-workspace-from-images` script produces a OCR-D workspace (incl. METS)
			`for the given images.`

			`~~~`
📝 README: s/my_ocrd_workflow/ocrd-galley (where appropiate) 4 years ago			`~/devel/ocrd-galley/ocrd-workspace-from-images 0005.png`
📝 README: Fix error in ocrd-workspace-from-images example 4 years ago			`cd workspace-xxxxx # output by the last command`
🧹 Update README for the newest changes + clean-up 3 years ago			`~/devel/ocrd-galley/my_ocrd_workflow`
✨ ocrd-workspace-from-images 4 years ago			`~~~`

			`This produces a workspace from the files and then runs the OCR workflow on it.`
✒ README: Update that we have images again + how to download models 1 year ago
			`Build the containers yourself`
			`-----------------------------`
			`To build the containers yourself using Docker:`
			`~~~`
			`cd ~/devel/ocrd-galley/`
			`./build`
			`~~~`