ocrd-workspace-from-images

pull/38/head
Gerber, Mike 4 years ago
parent 9d42de5da4
commit a6695f2927

@ -69,8 +69,8 @@ firefox OCR-D-OCR-CALAMARI-EVAL/OCR-D-OCR-CALAMARI-EVAL_00000024.html
ppn2ocr ppn2ocr
------- -------
The `ppn2ocr` script produces a METS file with the best images for a given The `ppn2ocr` script produces a workspace and METS file with the best images for
document in the State Library Berlin (SBB)'s digitized collection. a given document in the State Library Berlin (SBB)'s digitized collection.
Install it with an up-to-date pip (otherwise this will fail due to [a opencv-python-headless build failure](https://github.com/skvark/opencv-python#frequently-asked-questions)): Install it with an up-to-date pip (otherwise this will fail due to [a opencv-python-headless build failure](https://github.com/skvark/opencv-python#frequently-asked-questions)):
~~~ ~~~
@ -87,7 +87,19 @@ cd PPN77164308X
This produces a workspace directory `PPN77164308X` with the OCR results in it; This produces a workspace directory `PPN77164308X` with the OCR results in it;
the results are viewable as explained above. the results are viewable as explained above.
ppn2ocr requires a working Docker setup and properly set up environment ppn2ocr requires properly set up environment variables for the proxy
variables for the proxy configuration. At SBB, please read configuration. At SBB, please read `howto/docker-proxy.md` and
`howto/docker-proxy.md` and `howto/proxy-settings-for-shell+python.md` `howto/proxy-settings-for-shell+python.md` (in qurator's mono-repo).
(in qurator's mono-repo).
ocrd-workspace-from-images
--------------------------
The `ocrd-workspace-from-images` script produces a OCR-D workspace (incl. METS)
for the given images.
~~~
~/devel/my_ocrd_workflow/ocrd-workspace-from-images 0005.png
cd workspace-sj4EH
~/devel/my_ocrd_workflow/run-docker-hub -I BEST --skip-validation
~~~
This produces a workspace from the files and then runs the OCR workflow on it.

@ -0,0 +1,38 @@
#!/bin/bash
# Create an OCR-D workspace from images
#
# ocrd-workspace-from-images *.png
#
# In order to produce a workspace that validates, this script makes best effort
# to generate random IDs and to create the necessary structures like the
# physical page sequence.
workspace_dir=`mktemp -d "workspace-XXXXX"`
workspace_id=`basename $workspace_dir`
ocrd workspace -d $workspace_dir init
ocrd workspace -d $workspace_dir set-id $workspace_id
make_file_id_from_filename() {
filename="$1"
file_id="$filename"
file_id=`echo $file_id | sed 's#(.png|.tif|.jpe?g)$##i'`
file_id=`echo $file_id | sed 's#[^A-Za-z0-9_-]#_#g'`
echo "$file_id"
}
mkdir $workspace_dir/OCR-D-IMG
page_count=0
for img_orig in "$@"; do
page_count=$(($page_count + 1))
img="$workspace_dir/OCR-D-IMG/`basename $img_orig`"
cp -L "$img_orig" "$img"
file_id=`make_file_id_from_filename "$img"`
mime_type=`file -b --mime-type "$img"`
page_id=`printf "P%05d" $page_count`
ocrd workspace -d $workspace_dir add -G OCR-D-IMG "$img" --file-id $file_id --page-id $page_id --mimetype $mime_type
done
ocrd workspace -d $workspace_dir validate
echo $workspace_dir
Loading…
Cancel
Save