Title :
Improved degraded document recognition with hybrid modeling techniques and character n-grams
Author :
Brakensiek, Anja ; Willett, Daniel ; Rigoll, Gerhard
Author_Institution :
Dept. of Comput. Sci., Gerhard-Mercator-Univ. Duisburg, Germany
Abstract :
A robust multifont character recognition system for degraded documents, such as photocopy or fax, is described. The system is based on hidden Markov models using discrete and hybrid modeling techniques, where the latter makes use of an information theory-based neural network. The presented recognition results refer to the SEDAL-database of English documents using no dictionary. It is also demonstrated that the usage of a language model that consists of character n-grams yields significantly better recognition results. Our resulting system clearly outperforms commercial systems and leads to further error rate reductions compared to previous results reached on this database
Keywords :
database management systems; document image processing; feature extraction; hidden Markov models; information theory; neural nets; optical character recognition; SEDAL-database; degraded document recognition; feature extraction; hidden Markov models; information theory; multifont character recognition; neural network; Character recognition; Computer science; Databases; Degradation; Error analysis; Hidden Markov models; Image recognition; Optical character recognition software; Robustness; Testing;
Conference_Titel :
Pattern Recognition, 2000. Proceedings. 15th International Conference on
Conference_Location :
Barcelona
Print_ISBN :
0-7695-0750-6
DOI :
10.1109/ICPR.2000.902952