Title :
Context-Sensitive Error Correction: Using Topic Models to Improve OCR
Author :
Wick, Michael L. ; Ross, Michael G. ; Learned-Miller, Erik G.
Author_Institution :
Univ. of Massachusetts Amherst, Amherst
Abstract :
Modern optical, character recognition software relies on human interaction to correct mis recognized characters. Even though the software often reliably identifies low-confidence output, the simple language and vocabulary models employed are insufficient to automatically correct mistakes. This paper demonstrates that topic models, which automatically detect and represent an article´s semantic context, reduces error by 7% over a global word distribution in a simulated OCR correction task. Detecting and leveraging context in this manner is an important step towards improving OCR.
Keywords :
optical character recognition; OCR; context-sensitive error correction; global word distribution; human interaction; optical character recognition software; recognized characters; topic models; Character recognition; Context modeling; Error correction; Frequency; Hidden Markov models; Humans; Linear discriminant analysis; Optical character recognition software; Tongue; Vocabulary;
Conference_Titel :
Document Analysis and Recognition, 2007. ICDAR 2007. Ninth International Conference on
Conference_Location :
Parana
Print_ISBN :
978-0-7695-2822-9
DOI :
10.1109/ICDAR.2007.4377099