eynollah/docs/models.md

# Models documentation
This suite of 14 models presents a document layout analysis (DLA) system for historical documents implemented by
pixel-wise segmentation using a combination of a ResNet50 encoder with various U-Net decoders. In addition, heuristic
methods are applied to detect marginals and to determine the reading order of text regions.

The detection and classification of multiple classes of layout elements such as headings, images, tables etc. as part of
DLA is required in order to extract and process them in subsequent steps. Altogether, the combination of image
detection, classification and segmentation on the wide variety that can be found in over 400 years of printed cultural
heritage makes this a very challenging task. Deep learning models are complemented with heuristics for the detection of
text lines, marginals, and reading order. Furthermore, an optional image enhancement step was added in case of documents
that either have insufficient pixel density and/or require scaling. Also, a column classifier for the analysis of
multi-column documents was added. With these additions, DLA performance was improved, and a high accuracy in the
prediction of the reading order is accomplished.

Two Arabic/Persian terms form the name of the model suite: عين الله, which can be transcribed as "ain'allah" or
"eynollah"; it translates into English as "God's Eye" -- it sees (nearly) everything on the document image.

See the flowchart below for the different stages and how they interact:
![](https://user-images.githubusercontent.com/952378/100619946-1936f680-331e-11eb-9297-6e8b4cab3c16.png)

## Models

### Image enhancement
Model card: [Image Enhancement](https://huggingface.co/SBB/eynollah-enhancement)

This model addresses image resolution, specifically targeting documents with suboptimal resolution. In instances where
the detection of document layout exhibits inadequate performance, the proposed enhancement aims to significantly improve
the quality and clarity of the images, thus facilitating enhanced visual interpretation and analysis.

### Page extraction / border detection
Model card: [Page Extraction/Border Detection](https://huggingface.co/SBB/eynollah-page-extraction)

A problem that can negatively affect OCR are black margins around a page caused by document scanning. A deep learning
model helps to crop to the page borders by using a pixel-wise segmentation method.

### Column classification
Model card: [Column Classification](https://huggingface.co/SBB/eynollah-column-classifier)

This model is a trained classifier that recognizes the number of columns in a document by use of a training set with
manual classification of all documents into six classes with either one, two, three, four, five, or six and more columns
respectively.

### Binarization
Model card: [Binarization](https://huggingface.co/SBB/eynollah-binarization)

This model is designed to tackle the intricate task of document image binarization, which involves segmentation of the
image into white and black pixels. This process significantly contributes to the overall performance of the layout
models, particularly in scenarios where the documents are degraded or exhibit subpar quality. The robust binarization
capability of the model enables improved accuracy and reliability in subsequent layout analysis, thereby facilitating
enhanced document understanding and interpretation.

### Main region detection
Model card: [Main Region Detection](https://huggingface.co/SBB/eynollah-main-regions)

This model has employed a different set of labels, including an artificial class specifically designed to encompass the
text regions. The inclusion of this artificial class facilitates easier isolation of text regions by the model. This
approach grants the advantage of training the model using downscaled images, which in turn leads to faster predictions
during the inference phase. By incorporating this methodology, improved efficiency is achieved without compromising the
model's ability to accurately identify and classify text regions within documents.

### Main region detection (with scaling augmentation)
Model card: [Main Region Detection (with scaling augmentation)](https://huggingface.co/SBB/eynollah-main-regions-aug-scaling)

Utilizing scaling augmentation, this model leverages the capability to effectively segment elements of extremely high or
low scales within documents. By harnessing this technique, the tool gains a significant advantage in accurately
categorizing and isolating such elements, thereby enhancing its overall performance and enabling precise analysis of
documents with varying scale characteristics.

### Main region detection (with rotation augmentation)
Model card: [Main Region Detection (with rotation augmentation)](https://huggingface.co/SBB/eynollah-main-regions-aug-rotation)

This model takes advantage of rotation augmentation. This helps the tool to segment the vertical text regions in a
robust way.

### Main region detection (ensembled)
Model card: [Main Region Detection (ensembled)](https://huggingface.co/SBB/eynollah-main-regions-ensembled)

The robustness of this model is attained through an ensembling technique that combines the weights from various epochs.
By employing this approach, the model achieves a high level of resilience and stability, effectively leveraging the
strengths of multiple epochs to enhance its overall performance and deliver consistent and reliable results.

### Full region detection (1,2-column documents)
Model card: [Full Region Detection (1,2-column documents)](https://huggingface.co/SBB/eynollah-full-regions-1column)

This model deals with documents comprising of one and two columns.

### Full region detection (3,n-column documents)
Model card: [Full Region Detection (3,n-column documents)](https://huggingface.co/SBB/eynollah-full-regions-3pluscolumn)

This model is responsible for detecting headers and drop capitals in documents with three or more columns.

### Textline detection
Model card: [Textline Detection](https://huggingface.co/SBB/eynollah-textline)

The method for textline detection combines deep learning and heuristics. In the deep learning part, an image-to-image
model performs binary segmentation of the document into the classes textline vs. background. In the heuristics part,
bounding boxes or contours are derived from binary segmentation.

Skewed documents can heavily affect textline detection accuracy, so robust deskewing is needed. But detecting textlines
with rectangle bounding boxes cannot deal with partially curved textlines. To address this, a functionality
specifically for documents with curved textlines was included. After finding the contour of a text region and its
corresponding textline segmentation, the text region is cut into smaller vertical straps. For each strap, its textline
segmentation is first deskewed and then the textlines are separated with the same heuristic method as for finding
textline bounding boxes. Later, the strap is rotated back into its original orientation.

### Textline detection (light)
Model card: [Textline Detection Light (simpler but faster method)](https://huggingface.co/SBB/eynollah-textline_light)

The method for textline detection combines deep learning and heuristics. In the deep learning part, an image-to-image
model performs binary segmentation of the document into the classes textline vs. background. In the heuristics part,
bounding boxes or contours are derived from binary segmentation.

In the context of this textline model, a distinct labeling approach has been employed to ensure accurate predictions.
Specifically, an artificial bounding class has been incorporated alongside the textline classes. This strategic
inclusion effectively prevents any spurious connections between adjacent textlines during the prediction phase, thereby
enhancing the model's ability to accurately identify and delineate individual textlines within documents. This model
eliminates the need for additional heuristics in extracting textline contours.

### Table detection
Model card: [Table Detection](https://huggingface.co/SBB/eynollah-tables)

The objective of this model is to perform table segmentation in historical document images. Due to the pixel-wise
segmentation approach employed and the presence of traditional tables predominantly composed of text, the detection of
tables required the incorporation of heuristics to achieve reasonable performance. These heuristics were necessary to
effectively identify and delineate tables within the historical document images, ensuring accurate segmentation and
enabling subsequent analysis and interpretation.

### Image detection
Model card: [Image Detection](https://huggingface.co/SBB/eynollah-image-extraction)

This model is used for the task of illustration detection only.

### Reading order detection
Model card: [Reading Order Detection]()

TODO

## Heuristic methods
Additionally, some heuristic methods are employed to further improve the model predictions:
* After border detection, the largest contour is determined by a bounding box, and the image cropped to these coordinates.
* For text region detection, the image is scaled up to make it easier for the model to detect background space between text regions.
* A minimum area is defined for text regions in relation to the overall image dimensions, so that very small regions that are noise can be filtered out.
* Deskewing is applied on the text region level (due to regions having different degrees of skew) in order to improve the textline segmentation result.
* After deskewing, a calculation of the pixel distribution on the X-axis allows the separation of textlines (foreground) and background pixels.
* Finally, using the derived coordinates, bounding boxes are determined for each textline.