As the distance and editops calculation is a performance bottleneck in
this application we substituted the custom Levenshtein implementation to
the C implementation in the python-Levenshtein package.
We now also have separate entrypoints for texts with unicode normalization
and without because this also can be done more efficiently once upon
preprocessing.
Dinglehopper did not consider `OrderedGroupIndex` in the `ReadingOrder`
element when extracting text regions. As a consequence a `TableRegion`
was not considered for text extraction.
Sharing even parts of the .idea folder in worldwide setting is bound to
generate more problems than solutions. Therefore it should be removed
and consequently ignore in .gitignore.
Also adds some Python specific stuff to the .gitignore file.
When extracting text from TextEquiv nodes we may encounter nodes without
index or nodes that should get sorted via the conf attribute.
Therefore we added a more complex algorithm to extract a TextEquiv and
inform the user via log messages if we encounter structures that we can
handle but may produce unexpected results.
Python's `sorted` method will fail with a TypeError when called with
`None` and Integers:
```python
>>> sorted([None, 1])
TypeError: '<' not supported between instances of 'int' and 'NoneType'
```
Therefore we are using `float('inf')` instead of `None` in case of
missing textline indices.
Add a proposal for a .editorconfig file (see https://editorconfig.org/).
This is natively supported by a lot of editors, others are supported via
plugins.
This will close#19.