ocrd-galley/README.md

ocrd-galley
===========

[![Build Status](https://travis-ci.com/qurator-spk/ocrd-galley.svg?branch=master)](https://travis-ci.com/qurator-spk/ocrd-galley)

A Dockerized test environment for OCR-D processors 🚢

WIP. Given a OCR-D workspace with document images in the OCR-D-IMG file group,
the example workflow produces:

* Binarized images
* Line segmentation
* OCR text (using Calamari and Tesseract, both with GT4HistOCR models)
* (Given ground truth in OCR-D-GT-PAGE, also an OCR text evaluation report)

If you're interested in the exact processors, versions and parameters, please
take a look at the [script](my_ocrd_workflow) and possibly the individual
Dockerfiles.

Goal
----
Provide a **test environment** to produce OCR output for historical prints,
using OCR-D, especially [ocrd_calamari](https://github.com/OCR-D/ocrd_calamari)
and
[sbb_textline_detection](https://github.com/qurator-spk/sbb_textline_detection),
including all dependencies in Docker.

How to use
----------
It's easiest to use it as pre-built containers. To run the containers on an
example workspace:

~~~
# Update to the latest stable containers
(cd ~/devel/ocrd-galley/; ./run-docker-hub-update)

# Download an example workspace
cd /tmp
wget https://qurator-data.de/examples/actevedef_718448162.first-page.zip
unzip actevedef_718448162.first-page.zip

# Run the workflow on it
cd actevedef_718448162.first-page
~/devel/ocrd-galley/run-docker-hub
~~~

### Build the containers yourself
To build the containers yourself using Docker:
~~~
cd ~/devel/ocrd-galley/
./build
~~~
You may then use the script `run` to use your self-built containers, analogous to
the example above.

### Viewing results
You may then examine the results using
[PRImA's PAGE Viewer](https://www.primaresearch.org/tools/PAGEViewer):
~~~
java -jar /path/to/JPageViewer.jar \
  --resolve-dir . \
  OCR-D-OCR-CALAMARI/OCR-D-OCR-CALAMARI_00000024.xml
~~~

The example workflow also produces OCR evaluation reports using
[dinglehopper](https://github.com/qurator-spk/dinglehopper), if ground truth was
available:
~~~
firefox OCR-D-OCR-CALAMARI-EVAL/OCR-D-OCR-CALAMARI-EVAL_00000024.html
~~~

ppn2ocr
-------
The `ppn2ocr` script produces a workspace and METS file with the best images for
a given document in the State Library Berlin (SBB)'s digitized collection.

Install it with an up-to-date pip (otherwise this will fail due to [a opencv-python-headless build failure](https://github.com/skvark/opencv-python#frequently-asked-questions)):
~~~
pip install -r ~/devel/ocrd-galley/requirements-ppn2ocr.txt
~~~

The document must be specified by its PPN, for example:
~~~
~/devel/ocrd-galley/ppn2ocr PPN77164308X
cd PPN77164308X
~/devel/ocrd-galley/run-docker-hub -I BEST --skip-validation
~~~

This produces a workspace directory `PPN77164308X` with the OCR results in it;
the results are viewable as explained above.

ppn2ocr requires properly set up environment variables for the proxy
configuration. At SBB, please read `howto/docker-proxy.md` and
`howto/proxy-settings-for-shell+python.md` (in qurator's mono-repo).

ocrd-workspace-from-images
--------------------------
The `ocrd-workspace-from-images` script produces a OCR-D workspace (incl. METS)
for the given images.

~~~
~/devel/ocrd-galley/ocrd-workspace-from-images 0005.png
cd workspace-xxxxx  # output by the last command
~/devel/ocrd-galley/run-docker-hub
~~~

This produces a workspace from the files and then runs the OCR workflow on it.
📝 README: Rename to ocrd-galley 2020-10-28 17:20:02 +01:00			`ocrd-galley`
			`===========`
📜 Add README.md 2019-06-25 17:54:16 +02:00
📝 README: Rename to ocrd-galley 2020-10-28 17:20:02 +01:00			`[![Build Status](https://travis-ci.com/qurator-spk/ocrd-galley.svg?branch=master)](https://travis-ci.com/qurator-spk/ocrd-galley)`

			`A Dockerized test environment for OCR-D processors 🚢`
✅ Travis: Add status badge 2020-02-10 18:01:25 +01:00
📝 README: Describe what this does and why 2020-02-21 16:54:10 +01:00			`WIP. Given a OCR-D workspace with document images in the OCR-D-IMG file group,`
📝 README: Use "the example workflow" instead of "the workflow" 2020-10-28 17:20:47 +01:00			`the example workflow produces:`
📜 Add README.md 2019-06-25 17:54:16 +02:00
📝 README: Describe what this does and why 2020-02-21 16:54:10 +01:00			`* Binarized images`
			`* Line segmentation`
			`* OCR text (using Calamari and Tesseract, both with GT4HistOCR models)`
			`* (Given ground truth in OCR-D-GT-PAGE, also an OCR text evaluation report)`

📝 README: Update to reflect that we are now using multiple containers 2020-08-25 18:56:31 +02:00			`If you're interested in the exact processors, versions and parameters, please`
			`take a look at the [script](my_ocrd_workflow) and possibly the individual`
			`Dockerfiles.`
📝 README: Describe what this does and why 2020-02-21 16:54:10 +01:00
			`Goal`
			`----`
📝 README: Update to reflect that we are now using multiple containers 2020-08-25 18:56:31 +02:00			`Provide a test environment to produce OCR output for historical prints,`
			`using OCR-D, especially [ocrd_calamari](https://github.com/OCR-D/ocrd_calamari)`
			`and`
			`[sbb_textline_detection](https://github.com/qurator-spk/sbb_textline_detection),`
			`including all dependencies in Docker.`
📝 README: Describe what this does and why 2020-02-21 16:54:10 +01:00
			`How to use`
			`----------`
📝 README: Update to reflect that we are now using multiple containers 2020-08-25 18:56:31 +02:00			`It's easiest to use it as pre-built containers. To run the containers on an`
📝 README: Use the image from Docker Hub 2020-03-02 18:52:34 +01:00			`example workspace:`
📜 Add README.md 2019-06-25 17:54:16 +02:00
			`~~~`
🐳 run-docker-hub-update: Update to the latest containers 2020-09-25 17:46:55 +02:00			`# Update to the latest stable containers`
📝 README: s/my_ocrd_workflow/ocrd-galley (where appropiate) 2020-10-28 19:05:32 +01:00			`(cd ~/devel/ocrd-galley/; ./run-docker-hub-update)`
🐳 run-docker-hub-update: Update to the latest containers 2020-09-25 17:46:55 +02:00
📝 README: Include example workspace + reference PAGE Viewer and dinglehopper Fixes GH-7. 2020-02-21 13:21:06 +01:00			`# Download an example workspace`
			`cd /tmp`
			`wget https://qurator-data.de/examples/actevedef_718448162.first-page.zip`
			`unzip actevedef_718448162.first-page.zip`

			`# Run the workflow on it`
			`cd actevedef_718448162.first-page`
📝 README: s/my_ocrd_workflow/ocrd-galley (where appropiate) 2020-10-28 19:05:32 +01:00			`~/devel/ocrd-galley/run-docker-hub`
📝 README: Use the image from Docker Hub 2020-03-02 18:52:34 +01:00			`~~~`

📝 README: Update to reflect that we are now using multiple containers 2020-08-25 18:56:31 +02:00			`### Build the containers yourself`
			`To build the containers yourself using Docker:`
📝 README: Use the image from Docker Hub 2020-03-02 18:52:34 +01:00			`~~~`
📝 README: s/my_ocrd_workflow/ocrd-galley (where appropiate) 2020-10-28 19:05:32 +01:00			`cd ~/devel/ocrd-galley/`
📝 README: Use the image from Docker Hub 2020-03-02 18:52:34 +01:00			`./build`
📜 Add README.md 2019-06-25 17:54:16 +02:00			`~~~`
📝 README: Update to reflect that we are now using multiple containers 2020-08-25 18:56:31 +02:00			You may then use the script `run` to use your self-built containers, analogous to
📝 README: Use the image from Docker Hub 2020-03-02 18:52:34 +01:00			`the example above.`
📝 README: Include example workspace + reference PAGE Viewer and dinglehopper Fixes GH-7. 2020-02-21 13:21:06 +01:00
📝 README: Describe what this does and why 2020-02-21 16:54:10 +01:00			`### Viewing results`
📝 README: Include example workspace + reference PAGE Viewer and dinglehopper Fixes GH-7. 2020-02-21 13:21:06 +01:00			`You may then examine the results using`
			`[PRImA's PAGE Viewer](https://www.primaresearch.org/tools/PAGEViewer):`
			`~~~`
🗒️ README: Break jpageviewer line 2020-08-05 11:16:36 +02:00			`java -jar /path/to/JPageViewer.jar \`
			`--resolve-dir . \`
			`OCR-D-OCR-CALAMARI/OCR-D-OCR-CALAMARI_00000024.xml`
📝 README: Include example workspace + reference PAGE Viewer and dinglehopper Fixes GH-7. 2020-02-21 13:21:06 +01:00			`~~~`

📝 README: Use "the example workflow" instead of "the workflow" 2020-10-28 17:20:47 +01:00			`The example workflow also produces OCR evaluation reports using`
📝 README: Include example workspace + reference PAGE Viewer and dinglehopper Fixes GH-7. 2020-02-21 13:21:06 +01:00			`[dinglehopper](https://github.com/qurator-spk/dinglehopper), if ground truth was`
📝 README: Describe what this does and why 2020-02-21 16:54:10 +01:00			`available:`
📝 README: Include example workspace + reference PAGE Viewer and dinglehopper Fixes GH-7. 2020-02-21 13:21:06 +01:00			`~~~`
			`firefox OCR-D-OCR-CALAMARI-EVAL/OCR-D-OCR-CALAMARI-EVAL_00000024.html`
			`~~~`
📝 ppn2ocr: Add to README, including proxy configuration 2020-05-22 17:23:49 +02:00
			`ppn2ocr`
			`-------`
✨ ocrd-workspace-from-images 2020-10-09 16:45:44 +02:00			The `ppn2ocr` script produces a workspace and METS file with the best images for
			`a given document in the State Library Berlin (SBB)'s digitized collection.`
📝 README: Mention venv when installing ppn2ocr 2020-09-03 17:55:58 +02:00
📝 ppn2ocr: Mention upgrading pip Installing the ppn2ocr requirements fails when pip is not updated (see issue below), so mention the issue in the README. Quoting https://github.com/skvark/opencv-python#frequently-asked-questions: Q: Pip install fails with ModuleNotFoundError: No module named 'skbuild'? Since opencv-python version 4.3.0.*, manylinux1 wheels were replaced by manylinux2014 wheels. If your pip is too old, it will try to use the new source distribution introduced in 4.3.0.38 to manually build OpenCV because it does not know how to install manylinux2014 wheels. However, source build will also fail because of too old pip because it does not understand build dependencies in pyproject.toml. To use the new manylinux2014 pre-built wheels (or to build from source), your pip version must be >= 19.3. Please upgrade pip with pip install --upgrade pip. 2020-09-03 18:01:20 +02:00			`Install it with an up-to-date pip (otherwise this will fail due to [a opencv-python-headless build failure](https://github.com/skvark/opencv-python#frequently-asked-questions)):`
📝 ppn2ocr: Add to README, including proxy configuration 2020-05-22 17:23:49 +02:00			`~~~`
📝 README: s/my_ocrd_workflow/ocrd-galley (where appropiate) 2020-10-28 19:05:32 +01:00			`pip install -r ~/devel/ocrd-galley/requirements-ppn2ocr.txt`
📝 README: Mention venv when installing ppn2ocr 2020-09-03 17:55:58 +02:00			`~~~`

			`The document must be specified by its PPN, for example:`
			`~~~`
📝 README: s/my_ocrd_workflow/ocrd-galley (where appropiate) 2020-10-28 19:05:32 +01:00			`~/devel/ocrd-galley/ppn2ocr PPN77164308X`
🚧 ppn2ocr: Update README 2020-06-03 11:17:18 +02:00			`cd PPN77164308X`
📝 README: s/my_ocrd_workflow/ocrd-galley (where appropiate) 2020-10-28 19:05:32 +01:00			`~/devel/ocrd-galley/run-docker-hub -I BEST --skip-validation`
📝 ppn2ocr: Add to README, including proxy configuration 2020-05-22 17:23:49 +02:00			`~~~`

			This produces a workspace directory `PPN77164308X` with the OCR results in it;
			`the results are viewable as explained above.`

✨ ocrd-workspace-from-images 2020-10-09 16:45:44 +02:00			`ppn2ocr requires properly set up environment variables for the proxy`
			configuration. At SBB, please read `howto/docker-proxy.md` and
			`howto/proxy-settings-for-shell+python.md` (in qurator's mono-repo).

			`ocrd-workspace-from-images`
			`--------------------------`
			The `ocrd-workspace-from-images` script produces a OCR-D workspace (incl. METS)
			`for the given images.`

			`~~~`
📝 README: s/my_ocrd_workflow/ocrd-galley (where appropiate) 2020-10-28 19:05:32 +01:00			`~/devel/ocrd-galley/ocrd-workspace-from-images 0005.png`
📝 README: Fix error in ocrd-workspace-from-images example 2020-10-09 16:55:33 +02:00			`cd workspace-xxxxx # output by the last command`
📝 README: s/my_ocrd_workflow/ocrd-galley (where appropiate) 2020-10-28 19:05:32 +01:00			`~/devel/ocrd-galley/run-docker-hub`
✨ ocrd-workspace-from-images 2020-10-09 16:45:44 +02:00			`~~~`

			`This produces a workspace from the files and then runs the OCR workflow on it.`