DocumentCode :
3020536
Title :
The same is not the same - postcorrection of alphabet confusion errors in mixed-alphabet OCR recognition
Author :
Ringlstetter, Christoph ; Schulz, Klaus U. ; Mihov, Stoyan ; Louka, Katerina
fYear :
2005
fDate :
29 Aug.-1 Sept. 2005
Firstpage :
406
Abstract :
Character sets for Eastern European languages typically contain symbols that are optically almost or fully identical to Latin letters. When scanning documents with mixed Cyrillic-Latin or Greek-Latin alphabets, even high-quality OCR-software is often not able to correctly separate between Cyrillic (Greek) and Latin symbols. This effect leads to an error rate that is far beyond the usual error rates observed when recognizing single-alphabet documents. In this paper we first survey similarities between Latin and Cyrillic (Greek) letters and words for distinct languages and fonts. After briefly introducing a new and public corpus collected by our groups for evaluating OCR-technology over mixed-alphabet documents, we describe how to adapt general algorithms and tools for postcorrection of OCR results to the new context of mixed-alphabet recognition. Experimental results on Bulgarian documents from the corpus and from other sources demonstrate that a drastic reduction of error rates can be achieved.
Keywords :
character sets; document image processing; error statistics; image scanners; optical character recognition; alphabet confusion error statistics; character sets; document scanning; mixed-alphabet OCR recognition; optical character recognition; postcorrection method; Character recognition; Computational Intelligence Society; Ear; Error analysis; Optical character recognition software; Text analysis; Text processing;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Document Analysis and Recognition, 2005. Proceedings. Eighth International Conference on
ISSN :
1520-5263
Print_ISBN :
0-7695-2420-6
Type :
conf
DOI :
10.1109/ICDAR.2005.240
Filename :
1575578
Link To Document :
بازگشت