Benjamin Rosemann 
								
							 
						 
						
							
							
							
							
								
							
							
								85b784f9a1 
								
							 
						 
						
							
							
								
								Fix problem with json encoding  
							
							
							
						 
						
							2021-02-16 11:28:24 +01:00 
							
								 
							
							
								 
							
						 
					 
				
					
						
							
								
								
									Benjamin Rosemann 
								
							 
						 
						
							
							
							
							
								
							
							
								9e64c4f0d0 
								
							 
						 
						
							
							
								
								Remove obsolete test  
							
							
							
						 
						
							2021-02-16 11:28:24 +01:00 
							
								 
							
							
								 
							
						 
					 
				
					
						
							
								
								
									Benjamin Rosemann 
								
							 
						 
						
							
							
							
							
								
							
							
								b9259b9d01 
								
							 
						 
						
							
							
								
								Add multiprocessing to flexible_character_accuracy  
							
							
							
						 
						
							2021-02-16 11:28:24 +01:00 
							
								 
							
							
								 
							
						 
					 
				
					
						
							
								
								
									Benjamin Rosemann 
								
							 
						 
						
							
							
							
							
								
							
							
								c4f75d5264 
								
							 
						 
						
							
							
								
								Increase cache size for bad OCR results.  
							
							
							
						 
						
							2021-02-16 11:28:24 +01:00 
							
								 
							
							
								 
							
						 
					 
				
					
						
							
								
								
									Benjamin Rosemann 
								
							 
						 
						
							
							
							
							
								
							
							
								84d34f5b26 
								
							 
						 
						
							
							
								
								Fix annoying logging exceptions and encoding errors.  
							
							
							
						 
						
							2021-02-16 11:28:24 +01:00 
							
								 
							
							
								 
							
						 
					 
				
					
						
							
								
								
									Benjamin Rosemann 
								
							 
						 
						
							
							
							
							
								
							
							
								0dd5fc0ee5 
								
							 
						 
						
							
							
								
								Small corrections  
							
							
							
						 
						
							2021-02-16 11:28:24 +01:00 
							
								 
							
							
								 
							
						 
					 
				
					
						
							
								
								
									Benjamin Rosemann 
								
							 
						 
						
							
							
							
							
								
							
							
								b24d8d5664 
								
							 
						 
						
							
							
								
								Performance increases  
							
							... 
							
							
							
							Temporarily switch to the c-implementation of python-levenshtein for
editops calculatation. Also added some variables, caching and type
changes for performance gains. 
							
						 
						
							2021-02-16 11:28:24 +01:00 
							
								 
							
							
								 
							
						 
					 
				
					
						
							
								
								
									Benjamin Rosemann 
								
							 
						 
						
							
							
							
							
								
							
							
								0ef7810dd0 
								
							 
						 
						
							
							
								
								Reduce number of splits for short (one char) elements  
							
							
							
						 
						
							2021-02-16 11:28:24 +01:00 
							
								 
							
							
								 
							
						 
					 
				
					
						
							
								
								
									Benjamin Rosemann 
								
							 
						 
						
							
							
							
							
								
							
							
								c9219cbacd 
								
							 
						 
						
							
							
								
								Make sure that 0 cer and wer are reported  
							
							
							
						 
						
							2021-02-16 11:28:23 +01:00 
							
								 
							
							
								 
							
						 
					 
				
					
						
							
								
								
									Benjamin Rosemann 
								
							 
						 
						
							
							
							
							
								
							
							
								fd6f57a263 
								
							 
						 
						
							
							
								
								Fix broken build on Python 3.5  
							
							
							
						 
						
							2021-02-16 11:28:23 +01:00 
							
								 
							
							
								 
							
						 
					 
				
					
						
							
								
								
									Benjamin Rosemann 
								
							 
						 
						
							
							
							
							
								
							
							
								cac437afbf 
								
							 
						 
						
							
							
								
								Evaluate some performance issues  
							
							
							
						 
						
							2021-02-16 11:28:23 +01:00 
							
								 
							
							
								 
							
						 
					 
				
					
						
							
								
								
									Benjamin Rosemann 
								
							 
						 
						
							
							
							
							
								
							
							
								1bc7ef6c8b 
								
							 
						 
						
							
							
								
								Correct report for fca  
							
							... 
							
							
							
							As the fca implementation already knows the editing operations for each
segment we use a different sequence alignment method. 
							
						 
						
							2021-02-16 11:28:23 +01:00 
							
								 
							
							
								 
							
						 
					 
				
					
						
							
								
								
									Benjamin Rosemann 
								
							 
						 
						
							
							
							
							
								
							
							
								750ad00d1b 
								
							 
						 
						
							
							
								
								Add tooltips to fca report  
							
							
							
						 
						
							2021-02-16 11:28:23 +01:00 
							
								 
							
							
								 
							
						 
					 
				
					
						
							
								
								
									Benjamin Rosemann 
								
							 
						 
						
							
							
							
							
								
							
							
								53064bf833 
								
							 
						 
						
							
							
								
								Include fca as parameter and add some tests  
							
							
							
						 
						
							2021-02-16 11:28:23 +01:00 
							
								 
							
							
								 
							
						 
					 
				
					
						
							
								
								
									Benjamin Rosemann 
								
							 
						 
						
							
							
							
							
								
							
							
								4a87adc2c7 
								
							 
						 
						
							
							
								
								Implement version specific data structures  
							
							... 
							
							
							
							As ocr-d continues the support for Python 3.5 until the end of this year
version specific data structures have been implemented.
When the support for Python 3.5 is dropped the extra file can easily be
removed. 
							
						 
						
							2021-02-16 11:28:23 +01:00 
							
								 
							
							
								 
							
						 
					 
				
					
						
							
								
								
									Benjamin Rosemann 
								
							 
						 
						
							
							
							
							
								
							
							
								2a215a1062 
								
							 
						 
						
							
							
								
								Reformat using black  
							
							
							
						 
						
							2021-02-16 11:28:23 +01:00 
							
								 
							
							
								 
							
						 
					 
				
					
						
							
								
								
									Benjamin Rosemann 
								
							 
						 
						
							
							
							
							
								
							
							
								5277593bdb 
								
							 
						 
						
							
							
								
								Fix some special cases  
							
							
							
						 
						
							2021-02-16 11:28:23 +01:00 
							
								 
							
							
								 
							
						 
					 
				
					
						
							
								
								
									Benjamin Rosemann 
								
							 
						 
						
							
							
							
							
								
							
							
								d7a74fa58b 
								
							 
						 
						
							
							
								
								First draft of flexible character accuracy  
							
							
							
						 
						
							2021-02-16 11:28:23 +01:00 
							
								 
							
							
								 
							
						 
					 
				
					
						
							
								
								
									Benjamin Rosemann 
								
							 
						 
						
							
							
							
							
								
							
							
								a68fc269d9 
								
							 
						 
						
							
							
								
								Fix the extraction of text from Page with TableRegion  
							
							... 
							
							
							
							Dinglehopper did not consider `OrderedGroupIndex` in the `ReadingOrder`
element when extracting text regions. As a consequence a `TableRegion`
was not considered for text extraction. 
							
						 
						
							2020-11-27 11:18:11 +01:00 
							
								 
							
							
								 
							
						 
					 
				
					
						
							
								
								
									Konstantin Baierer 
								
							 
						 
						
							
							
							
							
								
							
							
								74e0ac18ed 
								
							 
						 
						
							
							
								
								ocrd cli: use core-provided zip_input_files method  
							
							
							
						 
						
							2020-11-19 16:00:28 +01:00 
							
								 
							
							
								 
							
						 
					 
				
					
						
							
						 
						
							
							
							
							
								
							
							
								389e253c11 
								
							 
						 
						
							
							
								
								🐛  dinglehopper: Fix alto_extract_lines()'s type annotation  
							
							
							
						 
						
							2020-11-12 19:32:38 +01:00 
							
								 
							
							
								 
							
						 
					 
				
					
						
							
						 
						
							
							
							
							
								
							
							
								fe3923a8af 
								
							 
						 
						
							
							
								
								🐛  dinglehopper: Fix alto_extract()'s type annotation  
							
							
							
						 
						
							2020-11-12 19:19:05 +01:00 
							
								 
							
							
								 
							
						 
					 
				
					
						
							
						 
						
							
							
							
							
								
							
							
								132f91d500 
								
							 
						 
						
							
							
								
								✔️  dinglehopper: Add missing integration test markers  
							
							
							
						 
						
							2020-11-12 19:10:23 +01:00 
							
								 
							
							
								 
							
						 
					 
				
					
						
							
								
								
									Benjamin Rosemann 
								
							 
						 
						
							
							
							
							
								
							
							
								ce752e1912 
								
							 
						 
						
							
							
								
								Remove .idea folder and modify .gitignore  
							
							... 
							
							
							
							Sharing even parts of the .idea folder in worldwide setting is bound to
generate more problems than solutions. Therefore it should be removed
and consequently ignore in .gitignore.
Also adds some Python specific stuff to the .gitignore file. 
							
						 
						
							2020-11-11 11:36:17 +01:00 
							
								 
							
							
								 
							
						 
					 
				
					
						
							
								
								
									Benjamin Rosemann 
								
							 
						 
						
							
							
							
							
								
							
							
								5270737c1f 
								
							 
						 
						
							
							
								
								Skip test on windows because it is unix specific.  
							
							
							
						 
						
							2020-11-11 11:36:17 +01:00 
							
								 
							
							
								 
							
						 
					 
				
					
						
							
						 
						
							
							
							
							
								
							
							
								32a4b95a99 
								
							 
						 
						
							
							
								
								🐛  dinglehopper: Normalize in plain_extract()  
							
							
							
						 
						
							2020-11-10 18:51:14 +01:00 
							
								 
							
							
								 
							
						 
					 
				
					
						
							
						 
						
							
							
							
							
								
							
							
								14421c8e53 
								
							 
						 
						
							
							
								
								🎨  dinglehopper: Reformat using black  
							
							
							
						 
						
							2020-11-10 12:29:55 +01:00 
							
								 
							
							
								 
							
						 
					 
				
					
						
							
						 
						
							
							
							
							
								
							
							
								31c63f9e4c 
								
							 
						 
						
							
							
								
								🎨  dinglehopper: s/LOG/log  
							
							
							
						 
						
							2020-11-09 16:55:43 +01:00 
							
								 
							
							
								 
							
						 
					 
				
					
						
							
								
								
									Robert Sachunsky 
								
							 
						 
						
							
							
								
								
							
							
							
								
							
							
								a60c14351e 
								
							 
						 
						
							
							
								
								1 more update for core's getLogger context  
							
							
							
						 
						
							2020-11-03 17:46:59 +01:00 
							
								 
							
							
								 
							
						 
					 
				
					
						
							
								
								
									Benjamin Rosemann 
								
							 
						 
						
							
							
							
							
								
							
							
								c02569b41e 
								
							 
						 
						
							
							
								
								Fix f-strings for Python 3.5  
							
							
							
						 
						
							2020-10-29 12:33:54 +01:00 
							
								 
							
							
								 
							
						 
					 
				
					
						
							
								
								
									Benjamin Rosemann 
								
							 
						 
						
							
							
							
							
								
							
							
								7b27b2834e 
								
							 
						 
						
							
							
								
								More complex sorting for text extraction  
							
							... 
							
							
							
							When extracting text from TextEquiv nodes we may encounter nodes without
index or nodes that should get sorted via the conf attribute.
Therefore we added a more complex algorithm to extract a TextEquiv and
inform the user via log messages if we encounter structures that we can
handle but may produce unexpected results. 
							
						 
						
							2020-10-29 10:03:40 +01:00 
							
								 
							
							
								 
							
						 
					 
				
					
						
							
								
								
									Benjamin Rosemann 
								
							 
						 
						
							
							
							
							
								
							
							
								6ff831dfd2 
								
							 
						 
						
							
							
								
								Sort textlines with missing indices  
							
							... 
							
							
							
							Python's `sorted` method will fail with a TypeError when called with
`None` and Integers:
```python
>>> sorted([None, 1])
TypeError: '<' not supported between instances of 'int' and 'NoneType'
```
Therefore we are using `float('inf')` instead of `None` in case of
missing textline indices. 
							
						 
						
							2020-10-29 10:03:40 +01:00 
							
								 
							
							
								 
							
						 
					 
				
					
						
							
						 
						
							
							
							
							
								
							
							
								5cbe148741 
								
							 
						 
						
							
							
								
								🐛  dinglehopper: Skip pages if there is no GT nor OCR (Fixes GH-34)  
							
							
							
						 
						
							2020-10-21 19:29:45 +02:00 
							
								 
							
							
								 
							
						 
					 
				
					
						
							
						 
						
							
							
							
							
								
							
							
								e4e2777cb7 
								
							 
						 
						
							
							
								
								🐛  dinglehopper: Do try to get text when no TextEquivs exist  
							
							
							
						 
						
							2020-10-21 17:59:44 +02:00 
							
								 
							
							
								 
							
						 
					 
				
					
						
							
						 
						
							
							
							
							
								
							
							
								1c88891a98 
								
							 
						 
						
							
							
								
								✔️  Add test data for LAREX's indexed TextEquivs (unused)  
							
							
							
						 
						
							2020-10-21 17:51:15 +02:00 
							
								 
							
							
								 
							
						 
					 
				
					
						
							
						 
						
							
							
							
							
								
							
							
								19d15e3ecc 
								
							 
						 
						
							
							
								
								🐛  dinglehopper: Honor TextEquiv index (Closes GH-33)  
							
							
							
						 
						
							2020-10-21 17:50:21 +02:00 
							
								 
							
							
								 
							
						 
					 
				
					
						
							
						 
						
							
							
							
							
								
							
							
								f626a2ebe6 
								
							 
						 
						
							
							
								
								🧹  dinglehopper: Remove warning when there is a non-TextRegion in the ReadingOrder  
							
							
							
						 
						
							2020-10-21 17:03:55 +02:00 
							
								 
							
							
								 
							
						 
					 
				
					
						
							
						 
						
							
							
							
							
								
							
							
								8b4ee20a40 
								
							 
						 
						
							
							
								
								✨  Add a new CLI tool dinglehopper-extract to just give the extracted text  
							
							
							
						 
						
							2020-10-21 16:30:48 +02:00 
							
								 
							
							
								 
							
						 
					 
				
					
						
							
						 
						
							
							
							
							
								
							
							
								b23b75b601 
								
							 
						 
						
							
							
								
								✨  dinglehopper: Give segment ids from the extracted textequiv_level  
							
							
							
						 
						
							2020-10-21 16:04:33 +02:00 
							
								 
							
							
								 
							
						 
					 
				
					
						
							
						 
						
							
							
							
							
								
							
							
								b23e4ce30e 
								
							 
						 
						
							
							
								
								✨  dinglehopper: Add OCR-D parameter to choose TextEquiv level  
							
							
							
						 
						
							2020-10-21 14:38:19 +02:00 
							
								 
							
							
								 
							
						 
					 
				
					
						
							
						 
						
							
							
							
							
								
							
							
								9744fa2567 
								
							 
						 
						
							
							
								
								✨  dinglehopper: Add CLI option to choose TextEquiv level  
							
							
							
						 
						
							2020-10-20 19:33:39 +02:00 
							
								 
							
							
								 
							
						 
					 
				
					
						
							
						 
						
							
							
							
							
								
							
							
								75733039b8 
								
							 
						 
						
							
							
								
								🧹  dinglehopper: Do not hardcode joiner to \n  
							
							
							
						 
						
							2020-10-20 18:43:56 +02:00 
							
								 
							
							
								 
							
						 
					 
				
					
						
							
						 
						
							
							
							
							
								
							
							
								3848412349 
								
							 
						 
						
							
							
								
								✨  dinglehopper: Implement the basic text extraction from PAGE TextLines  
							
							
							
						 
						
							2020-10-20 18:40:21 +02:00 
							
								 
							
							
								 
							
						 
					 
				
					
						
							
						 
						
							
							
							
							
								
							
							
								f2367ac0c3 
								
							 
						 
						
							
							
								
								🐛  Fix OCR-D CLI for newest OCR-D  
							
							... 
							
							
							
							Now that find_files() is a generator, we can't use [0] to get the file. 
							
						 
						
							2020-10-16 14:58:27 +02:00 
							
								 
							
							
								 
							
						 
					 
				
					
						
							
						 
						
							
							
							
							
								
							
							
								5ed184c8c4 
								
							 
						 
						
							
							
								
								✨  dinglehopper: Show a progressbar on --progress  
							
							
							
						 
						
							2020-10-15 16:09:54 +02:00 
							
								 
							
							
								 
							
						 
					 
				
					
						
							
						 
						
							
							
							
							
								
							
							
								4951823a29 
								
							 
						 
						
							
							
								
								🧹  dinglehopper: Disable metrics in JSON report, too  
							
							
							
						 
						
							2020-10-15 15:38:15 +02:00 
							
								 
							
							
								 
							
						 
					 
				
					
						
							
						 
						
							
							
							
							
								
							
							
								82217a25bb 
								
							 
						 
						
							
							
								
								🧹  dinglehopper: Move all normalization code to extracted_text.py  
							
							
							
						 
						
							2020-10-08 17:29:25 +02:00 
							
								 
							
							
								 
							
						 
					 
				
					
						
							
						 
						
							
							
							
							
								
							
							
								c6c6b8efab 
								
							 
						 
						
							
							
								
								📝  dinglehopper: Add detail about the text extraction and ExtractedText  
							
							
							
						 
						
							2020-10-08 17:05:36 +02:00 
							
								 
							
							
								 
							
						 
					 
				
					
						
							
						 
						
							
							
							
							
								
							
							
								f50591abac 
								
							 
						 
						
							
							
								
								Merge branch 'feat/display-segment-id'  
							
							
							
						 
						
							2020-10-08 13:39:38 +02:00 
							
								 
							
							
								 
							
						 
					 
				
					
						
							
						 
						
							
							
							
							
								
							
							
								c514abfb9f 
								
							 
						 
						
							
							
								
								🧹  dinglehopper: Sanitize imports  
							
							
							
						 
						
							2020-10-08 13:33:19 +02:00