📝 Document why we are using Unicode text segmentation to produce word results

2026-06-22 20:09:11 +02:00 · 2020-02-03 15:33:11 +01:00 · 2020-02-03 15:33:11 +01:00 · 91cca1e1b8
commit 91cca1e1b8
parent 0a572df0ba
1 changed files with 5 additions and 0 deletions
--- a/ocrd_calamari/recognize.py
+++ b/ocrd_calamari/recognize.py
@ -100,6 +100,11 @@ class CalamariRecognize(Processor):
                    line.set_TextEquiv([TextEquivType(Unicode=line_text, conf=line_conf)])
                    # Save word results
                    #
                    # Calamari OCR does not provide word positions, so we infer word positions from a. Unicode text
                    # segmentation and b. the glyph positions. This is necessary because the PAGE XML format enforces
                    # a strict hierarchy of lines > words > glyphs.
                    def unwanted(c):
                        return c == " "