mirror of
				https://github.com/qurator-spk/neat.git
				synced 2025-10-30 16:24:12 +01:00 
			
		
		
		
	Update README.md
This commit is contained in:
		
							parent
							
								
									590fec897a
								
							
						
					
					
						commit
						ce317957b1
					
				
					 1 changed files with 4 additions and 3 deletions
				
			
		|  | @ -34,9 +34,10 @@ Clone the repo using ``git clone https://github.com/qurator-spk/neat.git`` or do | ||||||
| The source data we use for annotation are OCR results in [PAGE-XML](https://github.com/PRImA-Research-Lab/PAGE-XML) format. We provide a [Python tool](https://github.com/qurator-spk/page2tsv) for the transformation of OCR files in [PAGE-XML](https://github.com/PRImA-Research-Lab/PAGE-XML) into the [TSV format](https://github.com/qurator-spk/neat/blob/master/README.md#22-data-format) used by [neat](https://github.com/qurator-spk/neat). | The source data we use for annotation are OCR results in [PAGE-XML](https://github.com/PRImA-Research-Lab/PAGE-XML) format. We provide a [Python tool](https://github.com/qurator-spk/page2tsv) for the transformation of OCR files in [PAGE-XML](https://github.com/PRImA-Research-Lab/PAGE-XML) into the [TSV format](https://github.com/qurator-spk/neat/blob/master/README.md#22-data-format) used by [neat](https://github.com/qurator-spk/neat). | ||||||
| 
 | 
 | ||||||
| The internal data format used by [neat](https://github.com/qurator-spk/neat) is based on the format used in the [GermEval2014 ](https://sites.google.com/site/germeval2014ner/data) Named Entity Recognition Shared Task. Text is encoded as one token per line, with name spans in the [IOB2](https://en.wikipedia.org/wiki/Inside%E2%80%93outside%E2%80%93beginning_(tagging)) format as tab-separated values: | The internal data format used by [neat](https://github.com/qurator-spk/neat) is based on the format used in the [GermEval2014 ](https://sites.google.com/site/germeval2014ner/data) Named Entity Recognition Shared Task. Text is encoded as one token per line, with name spans in the [IOB2](https://en.wikipedia.org/wiki/Inside%E2%80%93outside%E2%80%93beginning_(tagging)) format as tab-separated values: | ||||||
| * the first column contains either a `#`, which signals the source the sentence is cited from, or  | * the first column contains either  | ||||||
| * the token position within the sentence ``>=1`` |   * `#` a comment to indicate the source the sentence is taken from, or  | ||||||
| * sentence boundaries are indicated by ``0`` |   * ``>=1`` the token position within the sentence, or  | ||||||
|  |   * ``0`` to mark sentence boundaries  | ||||||
| * the second column contains the token ``text``  | * the second column contains the token ``text``  | ||||||
| * outer entity spans are encoded in the third column ``NE-TAG`` | * outer entity spans are encoded in the third column ``NE-TAG`` | ||||||
| * embedded entity spans are encoded in the fourth column ``NE-EMB``  | * embedded entity spans are encoded in the fourth column ``NE-EMB``  | ||||||
|  |  | ||||||
		Loading…
	
	Add table
		Add a link
		
	
		Reference in a new issue