📝 Document why we are using Unicode text segmentation to produce word results

2025-08-10 17:49:51 +02:00 · 2020-02-03 15:33:11 +01:00 · 2020-02-03 15:33:11 +01:00 · 91cca1e1b8
commit 91cca1e1b8
parent 0a572df0ba
1 changed files with 5 additions and 0 deletions
--- a/ocrd_calamari/recognize.py
+++ b/ocrd_calamari/recognize.py
@ -100,6 +100,11 @@ class CalamariRecognize(Processor):
                    line.set_TextEquiv([TextEquivType(Unicode=line_text, conf=line_conf)])

                    # Save word results
+                    #
+                    # Calamari OCR does not provide word positions, so we infer word positions from a. Unicode text
+                    # segmentation and b. the glyph positions. This is necessary because the PAGE XML format enforces
+                    # a strict hierarchy of lines > words > glyphs.
+
                    def unwanted(c):
                        return c == " "