mirror of
https://github.com/qurator-spk/eynollah.git
synced 2025-10-06 22:50:14 +02:00
📝 update README
This commit is contained in:
parent
830cc2c30a
commit
3123add815
4 changed files with 141 additions and 62 deletions
|
@ -1,5 +1,6 @@
|
|||
# Models documentation
|
||||
This suite of 14 models presents a document layout analysis (DLA) system for historical documents implemented by
|
||||
|
||||
This suite of 15 models presents a document layout analysis (DLA) system for historical documents implemented by
|
||||
pixel-wise segmentation using a combination of a ResNet50 encoder with various U-Net decoders. In addition, heuristic
|
||||
methods are applied to detect marginals and to determine the reading order of text regions.
|
||||
|
||||
|
@ -23,6 +24,7 @@ See the flowchart below for the different stages and how they interact:
|
|||
## Models
|
||||
|
||||
### Image enhancement
|
||||
|
||||
Model card: [Image Enhancement](https://huggingface.co/SBB/eynollah-enhancement)
|
||||
|
||||
This model addresses image resolution, specifically targeting documents with suboptimal resolution. In instances where
|
||||
|
@ -30,12 +32,14 @@ the detection of document layout exhibits inadequate performance, the proposed e
|
|||
the quality and clarity of the images, thus facilitating enhanced visual interpretation and analysis.
|
||||
|
||||
### Page extraction / border detection
|
||||
|
||||
Model card: [Page Extraction/Border Detection](https://huggingface.co/SBB/eynollah-page-extraction)
|
||||
|
||||
A problem that can negatively affect OCR are black margins around a page caused by document scanning. A deep learning
|
||||
model helps to crop to the page borders by using a pixel-wise segmentation method.
|
||||
|
||||
### Column classification
|
||||
|
||||
Model card: [Column Classification](https://huggingface.co/SBB/eynollah-column-classifier)
|
||||
|
||||
This model is a trained classifier that recognizes the number of columns in a document by use of a training set with
|
||||
|
@ -43,6 +47,7 @@ manual classification of all documents into six classes with either one, two, th
|
|||
respectively.
|
||||
|
||||
### Binarization
|
||||
|
||||
Model card: [Binarization](https://huggingface.co/SBB/eynollah-binarization)
|
||||
|
||||
This model is designed to tackle the intricate task of document image binarization, which involves segmentation of the
|
||||
|
@ -52,6 +57,7 @@ capability of the model enables improved accuracy and reliability in subsequent
|
|||
enhanced document understanding and interpretation.
|
||||
|
||||
### Main region detection
|
||||
|
||||
Model card: [Main Region Detection](https://huggingface.co/SBB/eynollah-main-regions)
|
||||
|
||||
This model has employed a different set of labels, including an artificial class specifically designed to encompass the
|
||||
|
@ -61,6 +67,7 @@ during the inference phase. By incorporating this methodology, improved efficien
|
|||
model's ability to accurately identify and classify text regions within documents.
|
||||
|
||||
### Main region detection (with scaling augmentation)
|
||||
|
||||
Model card: [Main Region Detection (with scaling augmentation)](https://huggingface.co/SBB/eynollah-main-regions-aug-scaling)
|
||||
|
||||
Utilizing scaling augmentation, this model leverages the capability to effectively segment elements of extremely high or
|
||||
|
@ -69,12 +76,14 @@ categorizing and isolating such elements, thereby enhancing its overall performa
|
|||
documents with varying scale characteristics.
|
||||
|
||||
### Main region detection (with rotation augmentation)
|
||||
|
||||
Model card: [Main Region Detection (with rotation augmentation)](https://huggingface.co/SBB/eynollah-main-regions-aug-rotation)
|
||||
|
||||
This model takes advantage of rotation augmentation. This helps the tool to segment the vertical text regions in a
|
||||
robust way.
|
||||
|
||||
### Main region detection (ensembled)
|
||||
|
||||
Model card: [Main Region Detection (ensembled)](https://huggingface.co/SBB/eynollah-main-regions-ensembled)
|
||||
|
||||
The robustness of this model is attained through an ensembling technique that combines the weights from various epochs.
|
||||
|
@ -82,16 +91,19 @@ By employing this approach, the model achieves a high level of resilience and st
|
|||
strengths of multiple epochs to enhance its overall performance and deliver consistent and reliable results.
|
||||
|
||||
### Full region detection (1,2-column documents)
|
||||
|
||||
Model card: [Full Region Detection (1,2-column documents)](https://huggingface.co/SBB/eynollah-full-regions-1column)
|
||||
|
||||
This model deals with documents comprising of one and two columns.
|
||||
|
||||
### Full region detection (3,n-column documents)
|
||||
|
||||
Model card: [Full Region Detection (3,n-column documents)](https://huggingface.co/SBB/eynollah-full-regions-3pluscolumn)
|
||||
|
||||
This model is responsible for detecting headers and drop capitals in documents with three or more columns.
|
||||
|
||||
### Textline detection
|
||||
|
||||
Model card: [Textline Detection](https://huggingface.co/SBB/eynollah-textline)
|
||||
|
||||
The method for textline detection combines deep learning and heuristics. In the deep learning part, an image-to-image
|
||||
|
@ -106,6 +118,7 @@ segmentation is first deskewed and then the textlines are separated with the sam
|
|||
textline bounding boxes. Later, the strap is rotated back into its original orientation.
|
||||
|
||||
### Textline detection (light)
|
||||
|
||||
Model card: [Textline Detection Light (simpler but faster method)](https://huggingface.co/SBB/eynollah-textline_light)
|
||||
|
||||
The method for textline detection combines deep learning and heuristics. In the deep learning part, an image-to-image
|
||||
|
@ -119,6 +132,7 @@ enhancing the model's ability to accurately identify and delineate individual te
|
|||
eliminates the need for additional heuristics in extracting textline contours.
|
||||
|
||||
### Table detection
|
||||
|
||||
Model card: [Table Detection](https://huggingface.co/SBB/eynollah-tables)
|
||||
|
||||
The objective of this model is to perform table segmentation in historical document images. Due to the pixel-wise
|
||||
|
@ -128,17 +142,21 @@ effectively identify and delineate tables within the historical document images,
|
|||
enabling subsequent analysis and interpretation.
|
||||
|
||||
### Image detection
|
||||
|
||||
Model card: [Image Detection](https://huggingface.co/SBB/eynollah-image-extraction)
|
||||
|
||||
This model is used for the task of illustration detection only.
|
||||
|
||||
### Reading order detection
|
||||
|
||||
Model card: [Reading Order Detection]()
|
||||
|
||||
TODO
|
||||
|
||||
## Heuristic methods
|
||||
|
||||
Additionally, some heuristic methods are employed to further improve the model predictions:
|
||||
|
||||
* After border detection, the largest contour is determined by a bounding box, and the image cropped to these coordinates.
|
||||
* For text region detection, the image is scaled up to make it easier for the model to detect background space between text regions.
|
||||
* A minimum area is defined for text regions in relation to the overall image dimensions, so that very small regions that are noise can be filtered out.
|
||||
|
|
|
@ -1,4 +1,5 @@
|
|||
# Training documentation
|
||||
|
||||
This aims to assist users in preparing training datasets, training models, and performing inference with trained models.
|
||||
We cover various use cases including pixel-wise segmentation, image classification, image enhancement, and machine-based
|
||||
reading order detection. For each use case, we provide guidance on how to generate the corresponding training dataset.
|
||||
|
@ -11,6 +12,7 @@ The following three tasks can all be accomplished using the code in the
|
|||
* inference with the trained model
|
||||
|
||||
## Generate training dataset
|
||||
|
||||
The script `generate_gt_for_training.py` is used for generating training datasets. As the results of the following
|
||||
command demonstrates, the dataset generator provides three different commands:
|
||||
|
||||
|
@ -23,14 +25,19 @@ These three commands are:
|
|||
* pagexml2label
|
||||
|
||||
### image-enhancement
|
||||
|
||||
Generating a training dataset for image enhancement is quite straightforward. All that is needed is a set of
|
||||
high-resolution images. The training dataset can then be generated using the following command:
|
||||
|
||||
`python generate_gt_for_training.py image-enhancement -dis "dir of high resolution images" -dois "dir where degraded
|
||||
images will be written" -dols "dir where the corresponding high resolution image will be written as label" -scs
|
||||
"degrading scales json file"`
|
||||
```sh
|
||||
python generate_gt_for_training.py image-enhancement \
|
||||
-dis "dir of high resolution images" \
|
||||
-dois "dir where degraded images will be written" \
|
||||
-dols "dir where the corresponding high resolution image will be written as label" \
|
||||
-scs "degrading scales json file"
|
||||
```
|
||||
|
||||
The scales JSON file is a dictionary with a key named 'scales' and values representing scales smaller than 1. Images are
|
||||
The scales JSON file is a dictionary with a key named `scales` and values representing scales smaller than 1. Images are
|
||||
downscaled based on these scales and then upscaled again to their original size. This process causes the images to lose
|
||||
resolution at different scales. The degraded images are used as input images, and the original high-resolution images
|
||||
serve as labels. The enhancement model can be trained with this generated dataset. The scales JSON file looks like this:
|
||||
|
@ -42,6 +49,7 @@ serve as labels. The enhancement model can be trained with this generated datase
|
|||
```
|
||||
|
||||
### machine-based-reading-order
|
||||
|
||||
For machine-based reading order, we aim to determine the reading priority between two sets of text regions. The model's
|
||||
input is a three-channel image: the first and last channels contain information about each of the two text regions,
|
||||
while the middle channel encodes prominent layout elements necessary for reading order, such as separators and headers.
|
||||
|
@ -52,10 +60,18 @@ For output images, it is necessary to specify the width and height. Additionally
|
|||
to filter out regions smaller than this minimum size. This minimum size is defined as the ratio of the text region area
|
||||
to the image area, with a default value of zero. To run the dataset generator, use the following command:
|
||||
|
||||
`python generate_gt_for_training.py machine-based-reading-order -dx "dir of GT xml files" -domi "dir where output images
|
||||
will be written" -docl "dir where the labels will be written" -ih "height" -iw "width" -min "min area ratio"`
|
||||
```shell
|
||||
python generate_gt_for_training.py machine-based-reading-order \
|
||||
-dx "dir of GT xml files" \
|
||||
-domi "dir where output images will be written" \
|
||||
-docl "dir where the labels will be written" \
|
||||
-ih "height" \
|
||||
-iw "width" \
|
||||
-min "min area ratio"
|
||||
```
|
||||
|
||||
### pagexml2label
|
||||
|
||||
pagexml2label is designed to generate labels from GT page XML files for various pixel-wise segmentation use cases,
|
||||
including 'layout,' 'textline,' 'printspace,' 'glyph,' and 'word' segmentation.
|
||||
To train a pixel-wise segmentation model, we require images along with their corresponding labels. Our training script
|
||||
|
@ -119,9 +135,13 @@ graphic region, "stamp" has its own class, while all other types are classified
|
|||
region" are also present in the label. However, other regions like "noise region" and "table region" will not be
|
||||
included in the label PNG file, even if they have information in the page XML files, as we chose not to include them.
|
||||
|
||||
`python generate_gt_for_training.py pagexml2label -dx "dir of GT xml files" -do "dir where output label png files will
|
||||
be written" -cfg "custom config json file" -to "output type which has 2d and 3d. 2d is used for training and 3d is just
|
||||
to visualise the labels" "`
|
||||
```sh
|
||||
python generate_gt_for_training.py pagexml2label \
|
||||
-dx "dir of GT xml files" \
|
||||
-do "dir where output label png files will be written" \
|
||||
-cfg "custom config json file" \
|
||||
-to "output type which has 2d and 3d. 2d is used for training and 3d is just to visualise the labels"
|
||||
```
|
||||
|
||||
We have also defined an artificial class that can be added to the boundary of text region types or text lines. This key
|
||||
is called "artificial_class_on_boundary." If users want to apply this to certain text regions in the layout use case,
|
||||
|
@ -169,12 +189,19 @@ in this scenario, since cropping will be applied to the label files, the directo
|
|||
provided to ensure that they are cropped in sync with the labels. This ensures that the correct images and labels
|
||||
required for training are obtained. The command should resemble the following:
|
||||
|
||||
`python generate_gt_for_training.py pagexml2label -dx "dir of GT xml files" -do "dir where output label png files will
|
||||
be written" -cfg "custom config json file" -to "output type which has 2d and 3d. 2d is used for training and 3d is just
|
||||
to visualise the labels" -ps -di "dir where the org images are located" -doi "dir where the cropped output images will
|
||||
be written" `
|
||||
```sh
|
||||
python generate_gt_for_training.py pagexml2label \
|
||||
-dx "dir of GT xml files" \
|
||||
-do "dir where output label png files will be written" \
|
||||
-cfg "custom config json file" \
|
||||
-to "output type which has 2d and 3d. 2d is used for training and 3d is just to visualise the labels" \
|
||||
-ps \
|
||||
-di "dir where the org images are located" \
|
||||
-doi "dir where the cropped output images will be written"
|
||||
```
|
||||
|
||||
## Train a model
|
||||
|
||||
### classification
|
||||
|
||||
For the classification use case, we haven't provided a ground truth generator, as it's unnecessary. For classification,
|
||||
|
@ -225,7 +252,9 @@ And the "dir_eval" the same structure as train directory:
|
|||
|
||||
The classification model can be trained using the following command line:
|
||||
|
||||
`python train.py with config_classification.json`
|
||||
```sh
|
||||
python train.py with config_classification.json
|
||||
```
|
||||
|
||||
As evident in the example JSON file above, for classification, we utilize a "f1_threshold_classification" parameter.
|
||||
This parameter is employed to gather all models with an evaluation f1 score surpassing this threshold. Subsequently,
|
||||
|
@ -276,6 +305,7 @@ The classification model can be trained like the classification case command lin
|
|||
### Segmentation (Textline, Binarization, Page extraction and layout) and enhancement
|
||||
|
||||
#### Parameter configuration for segmentation or enhancement usecases
|
||||
|
||||
The following parameter configuration can be applied to all segmentation use cases and enhancements. The augmentation,
|
||||
its sub-parameters, and continued training are defined only for segmentation use cases and enhancements, not for
|
||||
classification and machine-based reading order, as you can see in their example config files.
|
||||
|
@ -355,6 +385,7 @@ command, similar to the process for classification and reading order:
|
|||
`python train.py with config_classification.json`
|
||||
|
||||
#### Binarization
|
||||
|
||||
An example config json file for binarization can be like this:
|
||||
|
||||
```yaml
|
||||
|
@ -550,6 +581,7 @@ For page segmentation (or printspace or border segmentation), the model needs to
|
|||
hence the patches parameter should be set to false.
|
||||
|
||||
#### layout segmentation
|
||||
|
||||
An example config json file for layout segmentation with 5 classes (including background) can be like this:
|
||||
|
||||
```yaml
|
||||
|
@ -605,26 +637,41 @@ An example config json file for layout segmentation with 5 classes (including ba
|
|||
## Inference with the trained model
|
||||
|
||||
### classification
|
||||
|
||||
For conducting inference with a trained model, you simply need to execute the following command line, specifying the
|
||||
directory of the model and the image on which to perform inference:
|
||||
|
||||
`python inference.py -m "model dir" -i "image" `
|
||||
```sh
|
||||
python inference.py -m "model dir" -i "image"
|
||||
```
|
||||
|
||||
This will straightforwardly return the class of the image.
|
||||
|
||||
### machine based reading order
|
||||
|
||||
To infer the reading order using a reading order model, we need a page XML file containing layout information but
|
||||
without the reading order. We simply need to provide the model directory, the XML file, and the output directory.
|
||||
The new XML file with the added reading order will be written to the output directory with the same name.
|
||||
We need to run:
|
||||
|
||||
`python inference.py -m "model dir" -xml "page xml file" -o "output dir to write new xml with reading order" `
|
||||
```sh
|
||||
python inference.py \
|
||||
-m "model dir" \
|
||||
-xml "page xml file" \
|
||||
-o "output dir to write new xml with reading order"
|
||||
```
|
||||
|
||||
### Segmentation (Textline, Binarization, Page extraction and layout) and enhancement
|
||||
For conducting inference with a trained model for segmentation and enhancement you need to run the following command
|
||||
line:
|
||||
|
||||
`python inference.py -m "model dir" -i "image" -p -s "output image" `
|
||||
```sh
|
||||
python inference.py \
|
||||
-m "model dir" \
|
||||
-i "image" \
|
||||
-p \
|
||||
-s "output image"
|
||||
```
|
||||
|
||||
Note that in the case of page extraction the -p flag is not needed.
|
||||
|
||||
|
|
Loading…
Add table
Add a link
Reference in a new issue