improve docs/train.md

This commit is contained in:
Robert Sachunsky 2026-01-22 11:25:50 +01:00
parent 87d7ffbdd8
commit 6a81db934e

View file

@ -9,9 +9,9 @@ on how to generate the corresponding training dataset.
The following three tasks can all be accomplished using the code in the
[`train`](https://github.com/qurator-spk/eynollah/tree/main/train) directory:
* generate training dataset
* train a model
* inference with the trained model
* [Generate training dataset](#generate-training-dataset)
* [Train a model](#train-a-model)
* [Inference with the trained model](#inference-with-the-trained-model)
## Training, evaluation and output
@ -63,7 +63,7 @@ serve as labels. The enhancement model can be trained with this generated datase
For machine-based reading order, we aim to determine the reading priority between two sets of text regions. The model's
input is a three-channel image: the first and last channels contain information about each of the two text regions,
while the middle channel encodes prominent layout elements necessary for reading order, such as separators and headers.
To generate the training dataset, our script requires a page XML file that specifies the image layout with the correct
To generate the training dataset, our script requires a PAGE XML file that specifies the image layout with the correct
reading order.
For output images, it is necessary to specify the width and height. Additionally, a minimum text region size can be set
@ -82,8 +82,14 @@ eynollah-training generate-gt machine-based-reading-order \
### pagexml2label
pagexml2label is designed to generate labels from GT page XML files for various pixel-wise segmentation use cases,
including 'layout,' 'textline,' 'printspace,' 'glyph,' and 'word' segmentation.
`pagexml2label` is designed to generate labels from PAGE XML GT files for various pixel-wise segmentation use cases,
including:
- `printspace` (i.e. page frame),
- `layout` (i.e. regions),
- `textline`,
- `word`, and
- `glyph`.
To train a pixel-wise segmentation model, we require images along with their corresponding labels. Our training script
expects a PNG image where each pixel corresponds to a label, represented by an integer. The background is always labeled
as zero, while other elements are assigned different integers. For instance, if we have ground truth data with four
@ -93,7 +99,7 @@ In binary segmentation scenarios such as textline or page extraction, the backgr
element is automatically encoded as 1 in the PNG label.
To specify the desired use case and the elements to be extracted in the PNG labels, a custom JSON file can be passed.
For example, in the case of 'textline' detection, the JSON file would resemble this:
For example, in the case of textline detection, the JSON contents could be this:
```yaml
{
@ -101,61 +107,77 @@ For example, in the case of 'textline' detection, the JSON file would resemble t
}
```
In the case of layout segmentation a custom config json file can look like this:
In the case of layout segmentation, the config JSON file might look like this:
```yaml
{
"use_case": "layout",
"textregions":{"rest_as_paragraph":1 , "drop-capital": 1, "header":2, "heading":2, "marginalia":3},
"imageregion":4,
"separatorregion":5,
"graphicregions" :{"rest_as_decoration":6 ,"stamp":7}
"textregions": {"rest_as_paragraph": 1, "drop-capital": 1, "header": 2, "heading": 2, "marginalia": 3},
"imageregion": 4,
"separatorregion": 5,
"graphicregions": {"rest_as_decoration": 6, "stamp": 7}
}
```
A possible custom config json file for layout segmentation where the "printspace" is a class:
The same example if `PrintSpace` (or `Border`) should be represented as a unique class:
```yaml
{
"use_case": "layout",
"textregions":{"rest_as_paragraph":1 , "drop-capital": 1, "header":2, "heading":2, "marginalia":3},
"imageregion":4,
"separatorregion":5,
"graphicregions" :{"rest_as_decoration":6 ,"stamp":7}
"printspace_as_class_in_layout" : 8
"textregions": {"rest_as_paragraph": 1, "drop-capital": 1, "header": 2, "heading": 2, "marginalia": 3},
"imageregion": 4,
"separatorregion": 5,
"graphicregions": {"rest_as_decoration": 6, "stamp": 7}
"printspace_as_class_in_layout": 8
}
```
For the layout use case, it is beneficial to first understand the structure of the page XML file and its elements.
In a given image, the annotations of elements are recorded in a page XML file, including their contours and classes.
For an image document, the known regions are 'textregion', 'separatorregion', 'imageregion', 'graphicregion',
'noiseregion', and 'tableregion'.
In the `layout` use-case, it is beneficial to first understand the structure of the PAGE XML file and its elements.
For a given page image, the visible segments are annotated in XML with their polygon coordinates and types.
On the region level, available segment types include `TextRegion`, `SeparatorRegion`, `ImageRegion`, `GraphicRegion`,
`NoiseRegion` and `TableRegion`.
Text regions and graphic regions also have their own specific types. The known types for text regions are 'paragraph',
'header', 'heading', 'marginalia', 'drop-capital', 'footnote', 'footnote-continued', 'signature-mark', 'page-number',
and 'catch-word'. The known types for graphic regions are 'handwritten-annotation', 'decoration', 'stamp', and
'signature'.
Since we don't know all types of text and graphic regions, unknown cases can arise. To handle these, we have defined
two additional types, "rest_as_paragraph" and "rest_as_decoration", to ensure that no unknown types are missed.
This way, users can extract all known types from the labels and be confident that no unknown types are overlooked.
Moreover, text regions and graphic regions in particular are subdivided via `@type`:
- The allowed subtypes for text regions are `paragraph`, `heading`, `marginalia`, `drop-capital`, `header`, `footnote`,
`footnote-continued`, `signature-mark`, `page-number` and `catch-word`.
- The known subtypes for graphic regions are `handwritten-annotation`, `decoration`, `stamp` and `signature`.
In the custom JSON file shown above, "header" and "heading" are extracted as the same class, while "marginalia" is shown
as a different class. All other text region types, including "drop-capital," are grouped into the same class. For the
graphic region, "stamp" has its own class, while all other types are classified together. "Image region" and "separator
region" are also present in the label. However, other regions like "noise region" and "table region" will not be
included in the label PNG file, even if they have information in the page XML files, as we chose not to include them.
These types and subtypes must be mapped to classes for the segmentation model. However, sometimes these fine-grained
distinctions are not useful or the existing annotations are not very usable (too scarce or too unreliable).
In that case, instead of these subtypes with a specific mapping, they can be pooled together by using the two special
types:
- `rest_as_paragraph` (mapping missing TextRegion subtypes and `paragraph`)
- `rest_as_decoration` (mapping missing GraphicRegion subtypes and `decoration`)
(That way, users can extract all known types from the labels and be confident that no subtypes are overlooked.)
In the custom JSON example shown above, `header` and `heading` are extracted as the same class,
while `marginalia` is modelled as a different class. All other text region types, including `drop-capital`,
are grouped into the same class. For graphic regions, `stamp` has its own class, while all other types
are classified together. `ImageRegion` and `SeparatorRegion` will also represented with a class label in the
training data. However, other regions like `NoiseRegion` or `TableRegion` will not be included in the PNG files,
even if they were present in the PAGE XML.
The tool expects various command-line options:
```sh
eynollah-training generate-gt pagexml2label \
-dx "dir of GT xml files" \
-do "dir where output label png files will be written" \
-cfg "custom config json file" \
-to "output type which has 2d and 3d. 2d is used for training and 3d is just to visualise the labels"
-dx "dir of input PAGE XML files" \
-do "dir of output label PNG files" \
-cfg "custom config JSON file" \
-to "output type (2d or 3d)"
```
We have also defined an artificial class that can be added to the boundary of text region types or text lines. This key
is called "artificial_class_on_boundary." If users want to apply this to certain text regions in the layout use case,
the example JSON config file should look like this:
As output type, use
- `2d` for training,
- `3d` to just visualise the labels.
We have also defined an artificial class that can be added to (rendered around) the boundary
of text region types or text lines in order to make separation of neighbouring segments more
reliable. The key is called `artificial_class_on_boundary`, and it takes a list of text region
types to be applied to.
Our example JSON config file could then look like this:
```yaml
{
@ -177,14 +199,15 @@ the example JSON config file should look like this:
}
```
This implies that the artificial class label, denoted by 7, will be present on PNG files and will only be added to the
elements labeled as "paragraph," "header," "heading," and "marginalia."
This implies that the artificial class label (denoted by 7) will be present in the generated PNG files
and will only be added around segments labeled `paragraph`, `header`, `heading` or `marginalia`. (This
class will be handled specially during decoding at inference, and not show up in final results.)
For "textline", "word", and "glyph", the artificial class on the boundaries will be activated only if the
"artificial_class_label" key is specified in the config file. Its value should be set as 2 since these elements
represent binary cases. For example, if the background and textline are denoted as 0 and 1 respectively, then the
artificial class should be assigned the value 2. The example JSON config file should look like this for "textline" use
case:
For `printspace`, `textline`, `word`, and `glyph` segmentation use-cases, there is no `artificial_class_on_boundary` key,
but `artificial_class_label` is available. If specified in the config file, then its value should be set at 2, because
these elements represent binary classification problems (with background represented as 0, and segments as 1, respectively).
For example, the JSON config for textline detection could look as follows:
```yaml
{
@ -193,33 +216,33 @@ case:
}
```
If the coordinates of "PrintSpace" or "Border" are present in the page XML ground truth files, and the user wishes to
crop only the print space area, this can be achieved by activating the "-ps" argument. However, it should be noted that
in this scenario, since cropping will be applied to the label files, the directory of the original images must be
provided to ensure that they are cropped in sync with the labels. This ensures that the correct images and labels
required for training are obtained. The command should resemble the following:
If the coordinates of `PrintSpace` (or `Border`) are present in the PAGE XML ground truth files,
and one wishes to crop images to only cover the print space bounding box, this can be achieved
by passing the `-ps` option. Note that in this scenario, the directory of the original images
must also be provided, to ensure that the images are cropped in sync with the labels. The command
line would then resemble this:
```sh
eynollah-training generate-gt pagexml2label \
-dx "dir of GT xml files" \
-do "dir where output label png files will be written" \
-cfg "custom config json file" \
-to "output type which has 2d and 3d. 2d is used for training and 3d is just to visualise the labels" \
-dx "dir of input PAGE XML files" \
-do "dir of output label PNG files" \
-cfg "custom config JSON file" \
-to "output type (2d or 3d)" \
-ps \
-di "dir where the org images are located" \
-doi "dir where the cropped output images will be written"
-di "dir of input original images" \
-doi "dir of output cropped images"
```
## Train a model
### classification
For the classification use case, we haven't provided a ground truth generator, as it's unnecessary. For classification,
all we require is a training directory with subdirectories, each containing images of its respective classes. We need
For the image classification use-case, we have not provided a ground truth generator, as it is unnecessary.
All we require is a training directory with subdirectories, each containing images of its respective classes. We need
separate directories for training and evaluation, and the class names (subdirectories) must be consistent across both
directories. Additionally, the class names should be specified in the config JSON file, as shown in the following
example. If, for instance, we aim to classify "apple" and "orange," with a total of 2 classes, the
"classification_classes_name" key in the config file should appear as follows:
`classification_classes_name` key in the config file should appear as follows:
```yaml
{
@ -241,7 +264,7 @@ example. If, for instance, we aim to classify "apple" and "orange," with a total
}
```
The "dir_train" should be like this:
Then `dir_train` should be like this:
```
.
@ -250,7 +273,7 @@ The "dir_train" should be like this:
└── orange # directory of images for orange class
```
And the "dir_eval" the same structure as train directory:
And `dir_eval` analogously:
```
.
@ -310,7 +333,7 @@ And the "dir_eval" the same structure as train directory:
└── labels # directory of labels
```
The classification model can be trained like the classification case command line.
The reading-order model can be trained like the classification case command line.
### Segmentation (Textline, Binarization, Page extraction and layout) and enhancement
@ -374,9 +397,9 @@ classification and machine-based reading order, as you can see in their example
* `transformer_num_heads`: Transformer number of heads. Default value is 4.
* `transformer_cnn_first`: We have two types of vision transformers. In one type, a CNN is applied first, followed by a transformer. In the other type, this order is reversed. If transformer_cnn_first is true, it means the CNN will be applied before the transformer. Default value is true.
In the case of segmentation and enhancement the train and evaluation directory should be as following.
In case of segmentation and enhancement the train and evaluation data should be organised as follows.
The "dir_train" should be like this:
The "dir_train" directory should be like this:
```
.
@ -394,11 +417,12 @@ And the "dir_eval" the same structure as train directory:
└── labels # directory of labels
```
After configuring the JSON file for segmentation or enhancement, training can be initiated by running the following
command, similar to the process for classification and reading order:
After configuring the JSON file for segmentation or enhancement,
training can be initiated by running the following command line,
similar to classification and reading-order model training:
```
eynollah-training train with config_classification.json`
```sh
eynollah-training train with config_classification.json
```
#### Binarization
@ -690,7 +714,7 @@ This will straightforwardly return the class of the image.
### machine based reading order
To infer the reading order using a reading order model, we need a page XML file containing layout information but
To infer the reading order using a reading order model, we need a PAGE XML file containing layout information but
without the reading order. We simply need to provide the model directory, the XML file, and the output directory. The
new XML file with the added reading order will be written to the output directory with the same name. We need to run: