Title :
Omni font OCR error correction with effect on retrieval
Author :
Magdy, Walid ; Darwish, Kareem
Author_Institution :
Sch. of Comput., Dublin City Univ., Dublin, Ireland
fDate :
Nov. 29 2010-Dec. 1 2010
Abstract :
Recent library digitization projects attempt to provide large collections of printed material from varying sources in a searchable format. The scanned documents are typically processed using Optical Character Recognition (OCR), which typically introduces errors in the text. This paper proposes a technique for correction of OCR degraded text that is independent of character-level OCR errors, and hence independent of scanned document source. It is based on language modeling in conjunction with a uniform character model that uses edit distance only. The technique compares well to state-of-the-art correction techniques that are based on language modeling and source-specific character error models. Although the proposed technique yielded lower correction effectiveness, its impact on retrieval effectiveness is statistically significant and at par with state-of-the-art correction techniques. The main requirement of the proposed technique is the training of a “good” language model matching genre, style, and temporal coverage. The advantage of being independent of character level errors is clear in applications were printed documents vary in source, font, and degradation level.
Keywords :
document image processing; information retrieval; optical character recognition; text analysis; library digitization projects; omni font OCR error correction; optical character recognition; retrieval effect; uniform character model; Arabic Text; Error Correction; Information Retrieval; Language Modeling; OCR;
Conference_Titel :
Intelligent Systems Design and Applications (ISDA), 2010 10th International Conference on
Conference_Location :
Cairo
Print_ISBN :
978-1-4244-8134-7
DOI :
10.1109/ISDA.2010.5687228