From 009ed46d2623c994469cf0b705834276497a4973 Mon Sep 17 00:00:00 2001
From: Clemens Neudecker <clemens.neudecker@gmail.com>
Date: Thu, 14 Nov 2019 22:26:50 +0100
Subject: [PATCH] Update User_Guide.md

---
 docs/User_Guide.md | 61 ++++++++++++++++++++++++++++++++++++++++++++--
 1 file changed, 59 insertions(+), 2 deletions(-)

diff --git a/docs/User_Guide.md b/docs/User_Guide.md
index f38964d..e26ae87 100644
--- a/docs/User_Guide.md
+++ b/docs/User_Guide.md
@@ -8,9 +8,66 @@
 #### Technical Requirements 
 [neath](https://github.com/qurator-spk/neath) runs locally as a pure HTML+JavaScript webpage in your web browser. No software needs to be installed, but JavaScript has to be enabled in the browser. Any fairly recent browser should work, but only Chrome and Firefox are tested.
 #### Data input format   
-The input data format is based on the format used in the [GermEval2014 Named Entity Recognition Shared Task](https://sites.google.com/site/germeval2014ner/data). Here, text is encoded as one token per line, with information provided in tab-separated columns. The first column contains either a #, which signals the source the sentence is cited from and the date it was retrieved, or the token number within the sentence. The second column contains the token. Name spans are encoded in the BIO-scheme. Outer spans are encoded in the third column, embedded spans in the fourth column.
+The input data format is based on the format used in the [GermEval2014 Named Entity Recognition Shared Task](https://sites.google.com/site/germeval2014ner/data). Here, text is encoded as one token per line, with name spans encoded in the BIO-scheme, provided as tab-separated values:
+* the first column contains either a `#`, which signals the source the sentence is cited from, or the token position within the sentence
+* sentence boundaries are indicated by ``0``
+* the second column contains the token text 
+* outer spans are encoded in the third column
+* embedded spans are encoded in the fourth column
+
+Example (simple):
+```tsv
+No.	TOKEN	NE-TAG	NE-EMB
+# https://example.url
+1	Donnerstag	O	O
+2	,	O	O
+3	1	O	O	
+4	.	O	O	
+5	Januar	O	O	
+6	.	O	O		
+0		O	O
+1	Berliner	B-ORG	B-LOC	
+2	Tageblatt	I-ORG	O	
+3	.	O	O		
+0		O	O
+1	Nr	O	O	
+2	.	O	O		
+3	1	O	O	
+4	.	O	O	
+0		O	O
+1	Seite	O	O
+2	3	O	O
+```
+
+For our purposes we extend this format by adding
+* a fifth column for an identifier for the outer span from an authority file (in this case, the [GND](https://www.dnb.de/EN/Professionell/Standardisierung/GND/gnd_node.html) is used) 
+* column six for use as a variable
+* finally, columns 7+ are used for storing pixel coordinates for facsimile snippets 
+
+Example (full):
+```tsv
+No.	TOKEN	NE-TAG	NE-EMB	GND-ID	url_id	left,right,top,bottom
+# https://example.url/iiif/left,right,top,bottom/full/0/default.jpg
+1	Donnerstag	O	O	-	0	174,352,358,390
+2	,	O	O	-	0	174,352,358,390	
+3	1	O	O	-	0	367,392,361,381
+4	.	O	O	-	0	370,397,352,379
+5	Januar	O	O	-	0	406,518,358,386
+6	.	O	O	-	0	406,518,358,386	
+0
+1	Berliner	B-ORG	B-LOC	1086206452	0	816,984,358,388
+2	Tageblatt	I-ORG	O	1086206452	0	1005,1208,360,387
+3	.	O	O	-	0	1005,1208,360,387
+0
+1	Nr	O	O	-	0	1237,1288,360,382
+2	.	O	O	-	0	1237,1288,360,382
+3	1	O	O	-	0	1304,1326,361,381
+4	.	O	O	-	0	1304,1326,361,381
+0
+1	Seite	O	O	-	0	1837,1926,361,392
+2	3	O	O	-	0	1939,1967,364,385
+```
 
-Furthermore, we add a fifth column for an identifier from an authority file (in this case, the [GND](https://www.dnb.de/EN/Professionell/Standardisierung/GND/gnd_node.html) is used). Finally, columns six to nine are used for storing pixel coordinates for the facsimile snippets. 
 #### Data preparation  
 We also provide some [Python tools](https://github.com/qurator-spk/neath/tree/master/tools) that help with data wrangling.
 #### Navigation