diff --git a/README.md b/README.md index 1adc3d7..4683eb7 100644 --- a/README.md +++ b/README.md @@ -1,5 +1,6 @@ # Eynollah -> Document Layout Analysis with Deep Learning and Heuristics + +> Document Layout Analysis, Binarization and OCR with Deep Learning and Heuristics [![PyPI Version](https://img.shields.io/pypi/v/eynollah)](https://pypi.org/project/eynollah/) [![GH Actions Test](https://github.com/qurator-spk/eynollah/actions/workflows/test-eynollah.yml/badge.svg)](https://github.com/qurator-spk/eynollah/actions/workflows/test-eynollah.yml) @@ -23,6 +24,7 @@ historical documents and therefore processing can be very slow. We aim to improve this, but contributions are welcome. ## Installation + Python `3.8-3.11` with Tensorflow `<2.13` on Linux are currently supported. For (limited) GPU support the CUDA toolkit needs to be installed. @@ -42,19 +44,30 @@ cd eynollah; pip install -e . Alternatively, you can run `make install` or `make install-dev` for editable installation. +To also install the dependencies for the OCR engines: + +``` +pip install "eynollah[OCR]" +# or +make install EXTRAS=OCR +``` + ## Models -Pretrained models can be downloaded from [qurator-data.de](https://qurator-data.de/eynollah/) or [huggingface](https://huggingface.co/SBB?search_models=eynollah). +Pretrained models can be downloaded from [zenodo](https://zenodo.org/records/17194824) or [huggingface](https://huggingface.co/SBB?search_models=eynollah). For documentation on methods and models, have a look at [`models.md`](https://github.com/qurator-spk/eynollah/tree/main/docs/models.md). ## Train + In case you want to train your own model with Eynollah, have a look at [`train.md`](https://github.com/qurator-spk/eynollah/tree/main/docs/train.md). ## Usage -Eynollah supports four use cases: layout analysis (segmentation), binarization, text recognition (OCR), -and (trainable) reading order detection. + +Eynollah supports five use cases: layout analysis (segmentation), binarization, +image enhancement, text recognition (OCR), and (trainable) reading order detection. ### Layout Analysis + The layout analysis module is responsible for detecting layouts, identifying text lines, and determining reading order using both heuristic methods or a machine-based reading order detection model. @@ -97,58 +110,54 @@ and marginals). The best output quality is produced when RGB images are used as input rather than greyscale or binarized images. ### Binarization + The binarization module performs document image binarization using pretrained pixelwise segmentation models. The command-line interface for binarization of single image can be called like this: ```sh eynollah binarization \ + -i | -di \ + -o \ -m \ - \ - -``` - -and for flowing from a directory like this: - -```sh -eynollah binarization \ - -m \ - -di \ - -do ``` ### OCR + The OCR module performs text recognition from images using two main families of pretrained models: CNN-RNN–based OCR and Transformer-based OCR. The command-line interface for ocr can be called like this: ```sh eynollah ocr \ - -m | --model_name \ -i | -di \ -dx \ - -o + -o \ + -m | --model_name \ ``` ### Machine-based-reading-order + The machine-based reading-order module employs a pretrained model to identify the reading order from layouts represented in PAGE-XML files. The command-line interface for machine based reading order can be called like this: ```sh eynollah machine-based-reading-order \ - -m \ + -i | -di \ -xml | -dx \ + -m \ -o ``` #### Use as OCR-D processor + Eynollah ships with a CLI interface to be used as [OCR-D](https://ocr-d.de) [processor](https://ocr-d.de/en/spec/cli), formally described in [`ocrd-tool.json`](https://github.com/qurator-spk/eynollah/tree/main/src/eynollah/ocrd-tool.json). In this case, the source image file group with (preferably) RGB images should be used as input like this: - ocrd-eynollah-segment -I OCR-D-IMG -O OCR-D-SEG -P models 2022-04-05 + ocrd-eynollah-segment -I OCR-D-IMG -O OCR-D-SEG -P models eynollah_layout_v0_5_0 If the input file group is PAGE-XML (from a previous OCR-D workflow step), Eynollah behaves as follows: - existing regions are kept and ignored (i.e. in effect they might overlap segments from Eynollah results) @@ -160,14 +169,20 @@ If the input file group is PAGE-XML (from a previous OCR-D workflow step), Eynol (because some other preprocessing step was in effect like `denoised`), then the output PAGE-XML will be based on that as new top-level (`@imageFilename`) - ocrd-eynollah-segment -I OCR-D-XYZ -O OCR-D-SEG -P models 2022-04-05 + ocrd-eynollah-segment -I OCR-D-XYZ -O OCR-D-SEG -P models eynollah_layout_v0_5_0 Still, in general, it makes more sense to add other workflow steps **after** Eynollah. +There is also an OCR-D processor for the binarization: + + ocrd-sbb-binarize -I OCR-D-IMG -O OCR-D-BIN -P models default-2021-03-09 + #### Additional documentation + Please check the [wiki](https://github.com/qurator-spk/eynollah/wiki). ## How to cite + If you find this tool useful in your work, please consider citing our paper: ```bibtex diff --git a/docs/models.md b/docs/models.md index ac563b0..3d296d5 100644 --- a/docs/models.md +++ b/docs/models.md @@ -1,5 +1,6 @@ # Models documentation -This suite of 14 models presents a document layout analysis (DLA) system for historical documents implemented by + +This suite of 15 models presents a document layout analysis (DLA) system for historical documents implemented by pixel-wise segmentation using a combination of a ResNet50 encoder with various U-Net decoders. In addition, heuristic methods are applied to detect marginals and to determine the reading order of text regions. @@ -23,6 +24,7 @@ See the flowchart below for the different stages and how they interact: ## Models ### Image enhancement + Model card: [Image Enhancement](https://huggingface.co/SBB/eynollah-enhancement) This model addresses image resolution, specifically targeting documents with suboptimal resolution. In instances where @@ -30,12 +32,14 @@ the detection of document layout exhibits inadequate performance, the proposed e the quality and clarity of the images, thus facilitating enhanced visual interpretation and analysis. ### Page extraction / border detection + Model card: [Page Extraction/Border Detection](https://huggingface.co/SBB/eynollah-page-extraction) A problem that can negatively affect OCR are black margins around a page caused by document scanning. A deep learning model helps to crop to the page borders by using a pixel-wise segmentation method. ### Column classification + Model card: [Column Classification](https://huggingface.co/SBB/eynollah-column-classifier) This model is a trained classifier that recognizes the number of columns in a document by use of a training set with @@ -43,6 +47,7 @@ manual classification of all documents into six classes with either one, two, th respectively. ### Binarization + Model card: [Binarization](https://huggingface.co/SBB/eynollah-binarization) This model is designed to tackle the intricate task of document image binarization, which involves segmentation of the @@ -52,6 +57,7 @@ capability of the model enables improved accuracy and reliability in subsequent enhanced document understanding and interpretation. ### Main region detection + Model card: [Main Region Detection](https://huggingface.co/SBB/eynollah-main-regions) This model has employed a different set of labels, including an artificial class specifically designed to encompass the @@ -61,6 +67,7 @@ during the inference phase. By incorporating this methodology, improved efficien model's ability to accurately identify and classify text regions within documents. ### Main region detection (with scaling augmentation) + Model card: [Main Region Detection (with scaling augmentation)](https://huggingface.co/SBB/eynollah-main-regions-aug-scaling) Utilizing scaling augmentation, this model leverages the capability to effectively segment elements of extremely high or @@ -69,12 +76,14 @@ categorizing and isolating such elements, thereby enhancing its overall performa documents with varying scale characteristics. ### Main region detection (with rotation augmentation) + Model card: [Main Region Detection (with rotation augmentation)](https://huggingface.co/SBB/eynollah-main-regions-aug-rotation) This model takes advantage of rotation augmentation. This helps the tool to segment the vertical text regions in a robust way. ### Main region detection (ensembled) + Model card: [Main Region Detection (ensembled)](https://huggingface.co/SBB/eynollah-main-regions-ensembled) The robustness of this model is attained through an ensembling technique that combines the weights from various epochs. @@ -82,16 +91,19 @@ By employing this approach, the model achieves a high level of resilience and st strengths of multiple epochs to enhance its overall performance and deliver consistent and reliable results. ### Full region detection (1,2-column documents) + Model card: [Full Region Detection (1,2-column documents)](https://huggingface.co/SBB/eynollah-full-regions-1column) This model deals with documents comprising of one and two columns. ### Full region detection (3,n-column documents) + Model card: [Full Region Detection (3,n-column documents)](https://huggingface.co/SBB/eynollah-full-regions-3pluscolumn) This model is responsible for detecting headers and drop capitals in documents with three or more columns. ### Textline detection + Model card: [Textline Detection](https://huggingface.co/SBB/eynollah-textline) The method for textline detection combines deep learning and heuristics. In the deep learning part, an image-to-image @@ -106,6 +118,7 @@ segmentation is first deskewed and then the textlines are separated with the sam textline bounding boxes. Later, the strap is rotated back into its original orientation. ### Textline detection (light) + Model card: [Textline Detection Light (simpler but faster method)](https://huggingface.co/SBB/eynollah-textline_light) The method for textline detection combines deep learning and heuristics. In the deep learning part, an image-to-image @@ -119,6 +132,7 @@ enhancing the model's ability to accurately identify and delineate individual te eliminates the need for additional heuristics in extracting textline contours. ### Table detection + Model card: [Table Detection](https://huggingface.co/SBB/eynollah-tables) The objective of this model is to perform table segmentation in historical document images. Due to the pixel-wise @@ -128,17 +142,21 @@ effectively identify and delineate tables within the historical document images, enabling subsequent analysis and interpretation. ### Image detection + Model card: [Image Detection](https://huggingface.co/SBB/eynollah-image-extraction) This model is used for the task of illustration detection only. ### Reading order detection + Model card: [Reading Order Detection]() TODO ## Heuristic methods + Additionally, some heuristic methods are employed to further improve the model predictions: + * After border detection, the largest contour is determined by a bounding box, and the image cropped to these coordinates. * For text region detection, the image is scaled up to make it easier for the model to detect background space between text regions. * A minimum area is defined for text regions in relation to the overall image dimensions, so that very small regions that are noise can be filtered out. diff --git a/docs/train.md b/docs/train.md index 9f44a63..47ad67b 100644 --- a/docs/train.md +++ b/docs/train.md @@ -1,4 +1,5 @@ # Training documentation + This aims to assist users in preparing training datasets, training models, and performing inference with trained models. We cover various use cases including pixel-wise segmentation, image classification, image enhancement, and machine-based reading order detection. For each use case, we provide guidance on how to generate the corresponding training dataset. @@ -11,6 +12,7 @@ The following three tasks can all be accomplished using the code in the * inference with the trained model ## Generate training dataset + The script `generate_gt_for_training.py` is used for generating training datasets. As the results of the following command demonstrates, the dataset generator provides three different commands: @@ -23,14 +25,19 @@ These three commands are: * pagexml2label ### image-enhancement + Generating a training dataset for image enhancement is quite straightforward. All that is needed is a set of high-resolution images. The training dataset can then be generated using the following command: -`python generate_gt_for_training.py image-enhancement -dis "dir of high resolution images" -dois "dir where degraded -images will be written" -dols "dir where the corresponding high resolution image will be written as label" -scs -"degrading scales json file"` +```sh +python generate_gt_for_training.py image-enhancement \ + -dis "dir of high resolution images" \ + -dois "dir where degraded images will be written" \ + -dols "dir where the corresponding high resolution image will be written as label" \ + -scs "degrading scales json file" +``` -The scales JSON file is a dictionary with a key named 'scales' and values representing scales smaller than 1. Images are +The scales JSON file is a dictionary with a key named `scales` and values representing scales smaller than 1. Images are downscaled based on these scales and then upscaled again to their original size. This process causes the images to lose resolution at different scales. The degraded images are used as input images, and the original high-resolution images serve as labels. The enhancement model can be trained with this generated dataset. The scales JSON file looks like this: @@ -42,6 +49,7 @@ serve as labels. The enhancement model can be trained with this generated datase ``` ### machine-based-reading-order + For machine-based reading order, we aim to determine the reading priority between two sets of text regions. The model's input is a three-channel image: the first and last channels contain information about each of the two text regions, while the middle channel encodes prominent layout elements necessary for reading order, such as separators and headers. @@ -52,10 +60,18 @@ For output images, it is necessary to specify the width and height. Additionally to filter out regions smaller than this minimum size. This minimum size is defined as the ratio of the text region area to the image area, with a default value of zero. To run the dataset generator, use the following command: -`python generate_gt_for_training.py machine-based-reading-order -dx "dir of GT xml files" -domi "dir where output images -will be written" -docl "dir where the labels will be written" -ih "height" -iw "width" -min "min area ratio"` +```shell +python generate_gt_for_training.py machine-based-reading-order \ + -dx "dir of GT xml files" \ + -domi "dir where output images will be written" \ + -docl "dir where the labels will be written" \ + -ih "height" \ + -iw "width" \ + -min "min area ratio" +``` ### pagexml2label + pagexml2label is designed to generate labels from GT page XML files for various pixel-wise segmentation use cases, including 'layout,' 'textline,' 'printspace,' 'glyph,' and 'word' segmentation. To train a pixel-wise segmentation model, we require images along with their corresponding labels. Our training script @@ -119,9 +135,13 @@ graphic region, "stamp" has its own class, while all other types are classified region" are also present in the label. However, other regions like "noise region" and "table region" will not be included in the label PNG file, even if they have information in the page XML files, as we chose not to include them. -`python generate_gt_for_training.py pagexml2label -dx "dir of GT xml files" -do "dir where output label png files will -be written" -cfg "custom config json file" -to "output type which has 2d and 3d. 2d is used for training and 3d is just -to visualise the labels" "` +```sh +python generate_gt_for_training.py pagexml2label \ + -dx "dir of GT xml files" \ + -do "dir where output label png files will be written" \ + -cfg "custom config json file" \ + -to "output type which has 2d and 3d. 2d is used for training and 3d is just to visualise the labels" +``` We have also defined an artificial class that can be added to the boundary of text region types or text lines. This key is called "artificial_class_on_boundary." If users want to apply this to certain text regions in the layout use case, @@ -169,12 +189,19 @@ in this scenario, since cropping will be applied to the label files, the directo provided to ensure that they are cropped in sync with the labels. This ensures that the correct images and labels required for training are obtained. The command should resemble the following: -`python generate_gt_for_training.py pagexml2label -dx "dir of GT xml files" -do "dir where output label png files will -be written" -cfg "custom config json file" -to "output type which has 2d and 3d. 2d is used for training and 3d is just -to visualise the labels" -ps -di "dir where the org images are located" -doi "dir where the cropped output images will -be written" ` +```sh +python generate_gt_for_training.py pagexml2label \ + -dx "dir of GT xml files" \ + -do "dir where output label png files will be written" \ + -cfg "custom config json file" \ + -to "output type which has 2d and 3d. 2d is used for training and 3d is just to visualise the labels" \ + -ps \ + -di "dir where the org images are located" \ + -doi "dir where the cropped output images will be written" +``` ## Train a model + ### classification For the classification use case, we haven't provided a ground truth generator, as it's unnecessary. For classification, @@ -225,7 +252,9 @@ And the "dir_eval" the same structure as train directory: The classification model can be trained using the following command line: -`python train.py with config_classification.json` +```sh +python train.py with config_classification.json +``` As evident in the example JSON file above, for classification, we utilize a "f1_threshold_classification" parameter. This parameter is employed to gather all models with an evaluation f1 score surpassing this threshold. Subsequently, @@ -276,6 +305,7 @@ The classification model can be trained like the classification case command lin ### Segmentation (Textline, Binarization, Page extraction and layout) and enhancement #### Parameter configuration for segmentation or enhancement usecases + The following parameter configuration can be applied to all segmentation use cases and enhancements. The augmentation, its sub-parameters, and continued training are defined only for segmentation use cases and enhancements, not for classification and machine-based reading order, as you can see in their example config files. @@ -355,6 +385,7 @@ command, similar to the process for classification and reading order: `python train.py with config_classification.json` #### Binarization + An example config json file for binarization can be like this: ```yaml @@ -550,6 +581,7 @@ For page segmentation (or printspace or border segmentation), the model needs to hence the patches parameter should be set to false. #### layout segmentation + An example config json file for layout segmentation with 5 classes (including background) can be like this: ```yaml @@ -605,26 +637,41 @@ An example config json file for layout segmentation with 5 classes (including ba ## Inference with the trained model ### classification + For conducting inference with a trained model, you simply need to execute the following command line, specifying the directory of the model and the image on which to perform inference: -`python inference.py -m "model dir" -i "image" ` +```sh +python inference.py -m "model dir" -i "image" +``` This will straightforwardly return the class of the image. ### machine based reading order + To infer the reading order using a reading order model, we need a page XML file containing layout information but without the reading order. We simply need to provide the model directory, the XML file, and the output directory. The new XML file with the added reading order will be written to the output directory with the same name. We need to run: -`python inference.py -m "model dir" -xml "page xml file" -o "output dir to write new xml with reading order" ` +```sh +python inference.py \ + -m "model dir" \ + -xml "page xml file" \ + -o "output dir to write new xml with reading order" +``` ### Segmentation (Textline, Binarization, Page extraction and layout) and enhancement For conducting inference with a trained model for segmentation and enhancement you need to run the following command line: -`python inference.py -m "model dir" -i "image" -p -s "output image" ` +```sh +python inference.py \ + -m "model dir" \ + -i "image" \ + -p \ + -s "output image" +``` Note that in the case of page extraction the -p flag is not needed. diff --git a/tests/test_run.py b/tests/test_run.py index da0455a..be928a0 100644 --- a/tests/test_run.py +++ b/tests/test_run.py @@ -289,27 +289,26 @@ def test_run_eynollah_ocr_filename(tmp_path, subtests, pytestconfig, caplog): assert len(out_texts) >= 2, ("result is inaccurate", out_texts) assert sum(map(len, out_texts)) > 100, ("result is inaccurate", out_texts) -# kba Fri Sep 26 12:53:49 CEST 2025 -# Disabled until NHWC/NCHW error in https://github.com/qurator-spk/eynollah/actions/runs/18019655200/job/51273541895 debugged -# def test_run_eynollah_ocr_directory(tmp_path, subtests, pytestconfig, caplog): -# indir = testdir.joinpath('resources') -# outdir = tmp_path -# args = [ -# '-m', MODELS_OCR, -# '-di', str(indir), -# '-dx', str(indir), -# '-o', str(outdir), -# ] -# if pytestconfig.getoption('verbose') > 0: -# args.extend(['-l', 'DEBUG']) -# caplog.set_level(logging.INFO) -# def only_eynollah(logrec): -# return logrec.name == 'eynollah' -# runner = CliRunner() -# with caplog.filtering(only_eynollah): -# result = runner.invoke(ocr_cli, args, catch_exceptions=False) -# assert result.exit_code == 0, result.stdout -# logmsgs = [logrec.message for logrec in caplog.records] -# # FIXME: ocr has no logging! -# #assert any(True for logmsg in logmsgs if logmsg.startswith('???')), logmsgs -# assert len(list(outdir.iterdir())) == 2 +@pytest.mark.skip("Disabled until NHWC/NCHW error in https://github.com/qurator-spk/eynollah/actions/runs/18019655200/job/51273541895 debugged") +def test_run_eynollah_ocr_directory(tmp_path, subtests, pytestconfig, caplog): + indir = testdir.joinpath('resources') + outdir = tmp_path + args = [ + '-m', MODELS_OCR, + '-di', str(indir), + '-dx', str(indir), + '-o', str(outdir), + ] + if pytestconfig.getoption('verbose') > 0: + args.extend(['-l', 'DEBUG']) + caplog.set_level(logging.INFO) + def only_eynollah(logrec): + return logrec.name == 'eynollah' + runner = CliRunner() + with caplog.filtering(only_eynollah): + result = runner.invoke(ocr_cli, args, catch_exceptions=False) + assert result.exit_code == 0, result.stdout + logmsgs = [logrec.message for logrec in caplog.records] + # FIXME: ocr has no logging! + #assert any(True for logmsg in logmsgs if logmsg.startswith('???')), logmsgs + assert len(list(outdir.iterdir())) == 2