مرکز منطقه ای اطلاع رساني علوم و فناوري - The same is not the same - postcorrection of alphabet confusion errors in mixed-alphabet OCR recognition

DocumentCode :

3020536

Title :

The same is not the same - postcorrection of alphabet confusion errors in mixed-alphabet OCR recognition

Author :

Ringlstetter, Christoph ; Schulz, Klaus U. ; Mihov, Stoyan ; Louka, Katerina

fYear :

2005

fDate :

29 Aug.-1 Sept. 2005

Firstpage :

406

Abstract :

Character sets for Eastern European languages typically contain symbols that are optically almost or fully identical to Latin letters. When scanning documents with mixed Cyrillic-Latin or Greek-Latin alphabets, even high-quality OCR-software is often not able to correctly separate between Cyrillic (Greek) and Latin symbols. This effect leads to an error rate that is far beyond the usual error rates observed when recognizing single-alphabet documents. In this paper we first survey similarities between Latin and Cyrillic (Greek) letters and words for distinct languages and fonts. After briefly introducing a new and public corpus collected by our groups for evaluating OCR-technology over mixed-alphabet documents, we describe how to adapt general algorithms and tools for postcorrection of OCR results to the new context of mixed-alphabet recognition. Experimental results on Bulgarian documents from the corpus and from other sources demonstrate that a drastic reduction of error rates can be achieved.

Keywords :

character sets; document image processing; error statistics; image scanners; optical character recognition; alphabet confusion error statistics; character sets; document scanning; mixed-alphabet OCR recognition; optical character recognition; postcorrection method; Character recognition; Computational Intelligence Society; Ear; Error analysis; Optical character recognition software; Text analysis; Text processing;

fLanguage :

English

Publisher :

ieee

Conference_Titel :

Document Analysis and Recognition, 2005. Proceedings. Eighth International Conference on

ISSN :

1520-5263

Print_ISBN :

0-7695-2420-6

Type :

conf

DOI :

10.1109/ICDAR.2005.240

Filename :

1575578

Link To Document :

https://search.ricest.ac.ir/dl/search/defaultta.aspx?DTC=49&DC=3020536