mirror of
				https://github.com/qurator-spk/neat.git
				synced 2025-10-30 16:24:12 +01:00 
			
		
		
		
	Update Preprocessing.md
This commit is contained in:
		
							parent
							
								
									9dff8a78ba
								
							
						
					
					
						commit
						e7d0be2288
					
				
					 1 changed files with 3 additions and 3 deletions
				
			
		|  | @ -11,11 +11,11 @@ Layout Analysis & Textline Extraction @[sbb_pixelwise_segmentation](https://gith | |||
| ### OCR & Word Segmentation | ||||
| 
 | ||||
| OCR is based on [OCR-D](https://github.com/OCR-D)'s [ocrd_tesserocr](https://github.com/OCR-D/ocrd_tesserocr) which requires [Tesseract](https://github.com/tesseract-ocr/tesseract) **>= 4.1.0**. The [Fraktur_5000000](https://ub-backup.bib.uni-mannheim.de/~stweil/ocrd-train/data/Fraktur_5000000/) model, which is trained on [GT4HistOCR](https://github.com/tesseract-ocr/tesstrain/wiki/GT4HistOCR) is used.  | ||||
| corpiu | ||||
| The [PAGE-XML](https://github.com/PRImA-Research-Lab/PAGE-XML) produced by the [Layout Analysis & Textline Extraction](https://github.com/qurator-spk/neath/blob/master/docs/Preprocessing.md#layout-analysis--textline-extraction) is taken as input, and the output is [PAGE-XML](https://github.com/PRImA-Research-Lab/PAGE-XML) containing the text recognition results with absolute pixel coordinates describing bounding boxes for words.  | ||||
| 
 | ||||
| ### Tokenization | ||||
| 
 | ||||
| ### Named Entity Recognition | ||||
| 
 | ||||
| For Named Entity Recognition, a [BERT-Base](https://github.com/google-research/bert) model was trained. [sbb_ner](https://github.com/qurator-spk/sbb_ner) is using a combination of unsupervised training on a large (~2.3m pages) OCR corpus in combination with supervised training on a small (50k tokens) annotated corpus. Further details are available in the [paper](https://corpora.linguistik.uni-erlangen.de/data/konvens/proceedings/papers/KONVENS2019_paper_4.pdf). | ||||
| For Named Entity Recognition, a [BERT-Base](https://github.com/google-research/bert) model was trained for noisy OCR texts with historical spelling variation.  | ||||
| 
 | ||||
| [sbb_ner](https://github.com/qurator-spk/sbb_ner) is using a combination of unsupervised training on a large (~2.3m pages) [corpus of German OCR](https://zenodo.org/record/3257041) from the Digital Collections of the Berlin State Library in combination with supervised training on a small (47k tokens) [annotated corpus](https://github.com/EuropeanaNewspapers/ner-corpora/tree/master/enp_DE.sbb.bio) of OCR from digitized historical newspapers of the Berlin State Library. Further details are available in the [paper](https://corpora.linguistik.uni-erlangen.de/data/konvens/proceedings/papers/KONVENS2019_paper_4.pdf). | ||||
|  |  | |||
		Loading…
	
	Add table
		Add a link
		
	
		Reference in a new issue