mirror of
https://github.com/qurator-spk/eynollah.git
synced 2026-02-20 16:32:03 +01:00
improve docs/train.md
This commit is contained in:
parent
87d7ffbdd8
commit
6a81db934e
1 changed files with 96 additions and 72 deletions
168
docs/train.md
168
docs/train.md
|
|
@ -9,9 +9,9 @@ on how to generate the corresponding training dataset.
|
|||
The following three tasks can all be accomplished using the code in the
|
||||
[`train`](https://github.com/qurator-spk/eynollah/tree/main/train) directory:
|
||||
|
||||
* generate training dataset
|
||||
* train a model
|
||||
* inference with the trained model
|
||||
* [Generate training dataset](#generate-training-dataset)
|
||||
* [Train a model](#train-a-model)
|
||||
* [Inference with the trained model](#inference-with-the-trained-model)
|
||||
|
||||
## Training, evaluation and output
|
||||
|
||||
|
|
@ -63,7 +63,7 @@ serve as labels. The enhancement model can be trained with this generated datase
|
|||
For machine-based reading order, we aim to determine the reading priority between two sets of text regions. The model's
|
||||
input is a three-channel image: the first and last channels contain information about each of the two text regions,
|
||||
while the middle channel encodes prominent layout elements necessary for reading order, such as separators and headers.
|
||||
To generate the training dataset, our script requires a page XML file that specifies the image layout with the correct
|
||||
To generate the training dataset, our script requires a PAGE XML file that specifies the image layout with the correct
|
||||
reading order.
|
||||
|
||||
For output images, it is necessary to specify the width and height. Additionally, a minimum text region size can be set
|
||||
|
|
@ -82,8 +82,14 @@ eynollah-training generate-gt machine-based-reading-order \
|
|||
|
||||
### pagexml2label
|
||||
|
||||
pagexml2label is designed to generate labels from GT page XML files for various pixel-wise segmentation use cases,
|
||||
including 'layout,' 'textline,' 'printspace,' 'glyph,' and 'word' segmentation.
|
||||
`pagexml2label` is designed to generate labels from PAGE XML GT files for various pixel-wise segmentation use cases,
|
||||
including:
|
||||
- `printspace` (i.e. page frame),
|
||||
- `layout` (i.e. regions),
|
||||
- `textline`,
|
||||
- `word`, and
|
||||
- `glyph`.
|
||||
|
||||
To train a pixel-wise segmentation model, we require images along with their corresponding labels. Our training script
|
||||
expects a PNG image where each pixel corresponds to a label, represented by an integer. The background is always labeled
|
||||
as zero, while other elements are assigned different integers. For instance, if we have ground truth data with four
|
||||
|
|
@ -93,7 +99,7 @@ In binary segmentation scenarios such as textline or page extraction, the backgr
|
|||
element is automatically encoded as 1 in the PNG label.
|
||||
|
||||
To specify the desired use case and the elements to be extracted in the PNG labels, a custom JSON file can be passed.
|
||||
For example, in the case of 'textline' detection, the JSON file would resemble this:
|
||||
For example, in the case of textline detection, the JSON contents could be this:
|
||||
|
||||
```yaml
|
||||
{
|
||||
|
|
@ -101,61 +107,77 @@ For example, in the case of 'textline' detection, the JSON file would resemble t
|
|||
}
|
||||
```
|
||||
|
||||
In the case of layout segmentation a custom config json file can look like this:
|
||||
In the case of layout segmentation, the config JSON file might look like this:
|
||||
|
||||
```yaml
|
||||
{
|
||||
"use_case": "layout",
|
||||
"textregions":{"rest_as_paragraph":1 , "drop-capital": 1, "header":2, "heading":2, "marginalia":3},
|
||||
"imageregion":4,
|
||||
"separatorregion":5,
|
||||
"graphicregions" :{"rest_as_decoration":6 ,"stamp":7}
|
||||
"textregions": {"rest_as_paragraph": 1, "drop-capital": 1, "header": 2, "heading": 2, "marginalia": 3},
|
||||
"imageregion": 4,
|
||||
"separatorregion": 5,
|
||||
"graphicregions": {"rest_as_decoration": 6, "stamp": 7}
|
||||
}
|
||||
```
|
||||
|
||||
A possible custom config json file for layout segmentation where the "printspace" is a class:
|
||||
The same example if `PrintSpace` (or `Border`) should be represented as a unique class:
|
||||
|
||||
```yaml
|
||||
{
|
||||
"use_case": "layout",
|
||||
"textregions":{"rest_as_paragraph":1 , "drop-capital": 1, "header":2, "heading":2, "marginalia":3},
|
||||
"imageregion":4,
|
||||
"separatorregion":5,
|
||||
"graphicregions" :{"rest_as_decoration":6 ,"stamp":7}
|
||||
"printspace_as_class_in_layout" : 8
|
||||
"textregions": {"rest_as_paragraph": 1, "drop-capital": 1, "header": 2, "heading": 2, "marginalia": 3},
|
||||
"imageregion": 4,
|
||||
"separatorregion": 5,
|
||||
"graphicregions": {"rest_as_decoration": 6, "stamp": 7}
|
||||
"printspace_as_class_in_layout": 8
|
||||
}
|
||||
```
|
||||
|
||||
For the layout use case, it is beneficial to first understand the structure of the page XML file and its elements.
|
||||
In a given image, the annotations of elements are recorded in a page XML file, including their contours and classes.
|
||||
For an image document, the known regions are 'textregion', 'separatorregion', 'imageregion', 'graphicregion',
|
||||
'noiseregion', and 'tableregion'.
|
||||
In the `layout` use-case, it is beneficial to first understand the structure of the PAGE XML file and its elements.
|
||||
For a given page image, the visible segments are annotated in XML with their polygon coordinates and types.
|
||||
On the region level, available segment types include `TextRegion`, `SeparatorRegion`, `ImageRegion`, `GraphicRegion`,
|
||||
`NoiseRegion` and `TableRegion`.
|
||||
|
||||
Text regions and graphic regions also have their own specific types. The known types for text regions are 'paragraph',
|
||||
'header', 'heading', 'marginalia', 'drop-capital', 'footnote', 'footnote-continued', 'signature-mark', 'page-number',
|
||||
and 'catch-word'. The known types for graphic regions are 'handwritten-annotation', 'decoration', 'stamp', and
|
||||
'signature'.
|
||||
Since we don't know all types of text and graphic regions, unknown cases can arise. To handle these, we have defined
|
||||
two additional types, "rest_as_paragraph" and "rest_as_decoration", to ensure that no unknown types are missed.
|
||||
This way, users can extract all known types from the labels and be confident that no unknown types are overlooked.
|
||||
Moreover, text regions and graphic regions in particular are subdivided via `@type`:
|
||||
- The allowed subtypes for text regions are `paragraph`, `heading`, `marginalia`, `drop-capital`, `header`, `footnote`,
|
||||
`footnote-continued`, `signature-mark`, `page-number` and `catch-word`.
|
||||
- The known subtypes for graphic regions are `handwritten-annotation`, `decoration`, `stamp` and `signature`.
|
||||
|
||||
In the custom JSON file shown above, "header" and "heading" are extracted as the same class, while "marginalia" is shown
|
||||
as a different class. All other text region types, including "drop-capital," are grouped into the same class. For the
|
||||
graphic region, "stamp" has its own class, while all other types are classified together. "Image region" and "separator
|
||||
region" are also present in the label. However, other regions like "noise region" and "table region" will not be
|
||||
included in the label PNG file, even if they have information in the page XML files, as we chose not to include them.
|
||||
These types and subtypes must be mapped to classes for the segmentation model. However, sometimes these fine-grained
|
||||
distinctions are not useful or the existing annotations are not very usable (too scarce or too unreliable).
|
||||
In that case, instead of these subtypes with a specific mapping, they can be pooled together by using the two special
|
||||
types:
|
||||
- `rest_as_paragraph` (mapping missing TextRegion subtypes and `paragraph`)
|
||||
- `rest_as_decoration` (mapping missing GraphicRegion subtypes and `decoration`)
|
||||
|
||||
(That way, users can extract all known types from the labels and be confident that no subtypes are overlooked.)
|
||||
|
||||
In the custom JSON example shown above, `header` and `heading` are extracted as the same class,
|
||||
while `marginalia` is modelled as a different class. All other text region types, including `drop-capital`,
|
||||
are grouped into the same class. For graphic regions, `stamp` has its own class, while all other types
|
||||
are classified together. `ImageRegion` and `SeparatorRegion` will also represented with a class label in the
|
||||
training data. However, other regions like `NoiseRegion` or `TableRegion` will not be included in the PNG files,
|
||||
even if they were present in the PAGE XML.
|
||||
|
||||
The tool expects various command-line options:
|
||||
|
||||
```sh
|
||||
eynollah-training generate-gt pagexml2label \
|
||||
-dx "dir of GT xml files" \
|
||||
-do "dir where output label png files will be written" \
|
||||
-cfg "custom config json file" \
|
||||
-to "output type which has 2d and 3d. 2d is used for training and 3d is just to visualise the labels"
|
||||
-dx "dir of input PAGE XML files" \
|
||||
-do "dir of output label PNG files" \
|
||||
-cfg "custom config JSON file" \
|
||||
-to "output type (2d or 3d)"
|
||||
```
|
||||
|
||||
We have also defined an artificial class that can be added to the boundary of text region types or text lines. This key
|
||||
is called "artificial_class_on_boundary." If users want to apply this to certain text regions in the layout use case,
|
||||
the example JSON config file should look like this:
|
||||
As output type, use
|
||||
- `2d` for training,
|
||||
- `3d` to just visualise the labels.
|
||||
|
||||
We have also defined an artificial class that can be added to (rendered around) the boundary
|
||||
of text region types or text lines in order to make separation of neighbouring segments more
|
||||
reliable. The key is called `artificial_class_on_boundary`, and it takes a list of text region
|
||||
types to be applied to.
|
||||
|
||||
Our example JSON config file could then look like this:
|
||||
|
||||
```yaml
|
||||
{
|
||||
|
|
@ -177,14 +199,15 @@ the example JSON config file should look like this:
|
|||
}
|
||||
```
|
||||
|
||||
This implies that the artificial class label, denoted by 7, will be present on PNG files and will only be added to the
|
||||
elements labeled as "paragraph," "header," "heading," and "marginalia."
|
||||
This implies that the artificial class label (denoted by 7) will be present in the generated PNG files
|
||||
and will only be added around segments labeled `paragraph`, `header`, `heading` or `marginalia`. (This
|
||||
class will be handled specially during decoding at inference, and not show up in final results.)
|
||||
|
||||
For "textline", "word", and "glyph", the artificial class on the boundaries will be activated only if the
|
||||
"artificial_class_label" key is specified in the config file. Its value should be set as 2 since these elements
|
||||
represent binary cases. For example, if the background and textline are denoted as 0 and 1 respectively, then the
|
||||
artificial class should be assigned the value 2. The example JSON config file should look like this for "textline" use
|
||||
case:
|
||||
For `printspace`, `textline`, `word`, and `glyph` segmentation use-cases, there is no `artificial_class_on_boundary` key,
|
||||
but `artificial_class_label` is available. If specified in the config file, then its value should be set at 2, because
|
||||
these elements represent binary classification problems (with background represented as 0, and segments as 1, respectively).
|
||||
|
||||
For example, the JSON config for textline detection could look as follows:
|
||||
|
||||
```yaml
|
||||
{
|
||||
|
|
@ -193,33 +216,33 @@ case:
|
|||
}
|
||||
```
|
||||
|
||||
If the coordinates of "PrintSpace" or "Border" are present in the page XML ground truth files, and the user wishes to
|
||||
crop only the print space area, this can be achieved by activating the "-ps" argument. However, it should be noted that
|
||||
in this scenario, since cropping will be applied to the label files, the directory of the original images must be
|
||||
provided to ensure that they are cropped in sync with the labels. This ensures that the correct images and labels
|
||||
required for training are obtained. The command should resemble the following:
|
||||
If the coordinates of `PrintSpace` (or `Border`) are present in the PAGE XML ground truth files,
|
||||
and one wishes to crop images to only cover the print space bounding box, this can be achieved
|
||||
by passing the `-ps` option. Note that in this scenario, the directory of the original images
|
||||
must also be provided, to ensure that the images are cropped in sync with the labels. The command
|
||||
line would then resemble this:
|
||||
|
||||
```sh
|
||||
eynollah-training generate-gt pagexml2label \
|
||||
-dx "dir of GT xml files" \
|
||||
-do "dir where output label png files will be written" \
|
||||
-cfg "custom config json file" \
|
||||
-to "output type which has 2d and 3d. 2d is used for training and 3d is just to visualise the labels" \
|
||||
-dx "dir of input PAGE XML files" \
|
||||
-do "dir of output label PNG files" \
|
||||
-cfg "custom config JSON file" \
|
||||
-to "output type (2d or 3d)" \
|
||||
-ps \
|
||||
-di "dir where the org images are located" \
|
||||
-doi "dir where the cropped output images will be written"
|
||||
-di "dir of input original images" \
|
||||
-doi "dir of output cropped images"
|
||||
```
|
||||
|
||||
## Train a model
|
||||
|
||||
### classification
|
||||
|
||||
For the classification use case, we haven't provided a ground truth generator, as it's unnecessary. For classification,
|
||||
all we require is a training directory with subdirectories, each containing images of its respective classes. We need
|
||||
For the image classification use-case, we have not provided a ground truth generator, as it is unnecessary.
|
||||
All we require is a training directory with subdirectories, each containing images of its respective classes. We need
|
||||
separate directories for training and evaluation, and the class names (subdirectories) must be consistent across both
|
||||
directories. Additionally, the class names should be specified in the config JSON file, as shown in the following
|
||||
example. If, for instance, we aim to classify "apple" and "orange," with a total of 2 classes, the
|
||||
"classification_classes_name" key in the config file should appear as follows:
|
||||
`classification_classes_name` key in the config file should appear as follows:
|
||||
|
||||
```yaml
|
||||
{
|
||||
|
|
@ -241,7 +264,7 @@ example. If, for instance, we aim to classify "apple" and "orange," with a total
|
|||
}
|
||||
```
|
||||
|
||||
The "dir_train" should be like this:
|
||||
Then `dir_train` should be like this:
|
||||
|
||||
```
|
||||
.
|
||||
|
|
@ -250,7 +273,7 @@ The "dir_train" should be like this:
|
|||
└── orange # directory of images for orange class
|
||||
```
|
||||
|
||||
And the "dir_eval" the same structure as train directory:
|
||||
And `dir_eval` analogously:
|
||||
|
||||
```
|
||||
.
|
||||
|
|
@ -310,7 +333,7 @@ And the "dir_eval" the same structure as train directory:
|
|||
└── labels # directory of labels
|
||||
```
|
||||
|
||||
The classification model can be trained like the classification case command line.
|
||||
The reading-order model can be trained like the classification case command line.
|
||||
|
||||
### Segmentation (Textline, Binarization, Page extraction and layout) and enhancement
|
||||
|
||||
|
|
@ -374,9 +397,9 @@ classification and machine-based reading order, as you can see in their example
|
|||
* `transformer_num_heads`: Transformer number of heads. Default value is 4.
|
||||
* `transformer_cnn_first`: We have two types of vision transformers. In one type, a CNN is applied first, followed by a transformer. In the other type, this order is reversed. If transformer_cnn_first is true, it means the CNN will be applied before the transformer. Default value is true.
|
||||
|
||||
In the case of segmentation and enhancement the train and evaluation directory should be as following.
|
||||
In case of segmentation and enhancement the train and evaluation data should be organised as follows.
|
||||
|
||||
The "dir_train" should be like this:
|
||||
The "dir_train" directory should be like this:
|
||||
|
||||
```
|
||||
.
|
||||
|
|
@ -394,11 +417,12 @@ And the "dir_eval" the same structure as train directory:
|
|||
└── labels # directory of labels
|
||||
```
|
||||
|
||||
After configuring the JSON file for segmentation or enhancement, training can be initiated by running the following
|
||||
command, similar to the process for classification and reading order:
|
||||
After configuring the JSON file for segmentation or enhancement,
|
||||
training can be initiated by running the following command line,
|
||||
similar to classification and reading-order model training:
|
||||
|
||||
```
|
||||
eynollah-training train with config_classification.json`
|
||||
```sh
|
||||
eynollah-training train with config_classification.json
|
||||
```
|
||||
|
||||
#### Binarization
|
||||
|
|
@ -690,7 +714,7 @@ This will straightforwardly return the class of the image.
|
|||
|
||||
### machine based reading order
|
||||
|
||||
To infer the reading order using a reading order model, we need a page XML file containing layout information but
|
||||
To infer the reading order using a reading order model, we need a PAGE XML file containing layout information but
|
||||
without the reading order. We simply need to provide the model directory, the XML file, and the output directory. The
|
||||
new XML file with the added reading order will be written to the output directory with the same name. We need to run:
|
||||
|
||||
|
|
|
|||
Loading…
Add table
Add a link
Reference in a new issue