مرکز منطقه ای اطلاع رساني علوم و فناوري - Omni font OCR error correction with effect on retrieval

DocumentCode :

2064219

Title :

Omni font OCR error correction with effect on retrieval

Author :

Magdy, Walid ; Darwish, Kareem

Author_Institution :

Sch. of Comput., Dublin City Univ., Dublin, Ireland

fYear :

2010

fDate :

Nov. 29 2010-Dec. 1 2010

Firstpage :

415

Lastpage :

420

Abstract :

Recent library digitization projects attempt to provide large collections of printed material from varying sources in a searchable format. The scanned documents are typically processed using Optical Character Recognition (OCR), which typically introduces errors in the text. This paper proposes a technique for correction of OCR degraded text that is independent of character-level OCR errors, and hence independent of scanned document source. It is based on language modeling in conjunction with a uniform character model that uses edit distance only. The technique compares well to state-of-the-art correction techniques that are based on language modeling and source-specific character error models. Although the proposed technique yielded lower correction effectiveness, its impact on retrieval effectiveness is statistically significant and at par with state-of-the-art correction techniques. The main requirement of the proposed technique is the training of a “good” language model matching genre, style, and temporal coverage. The advantage of being independent of character level errors is clear in applications were printed documents vary in source, font, and degradation level.

Keywords :

document image processing; information retrieval; optical character recognition; text analysis; library digitization projects; omni font OCR error correction; optical character recognition; retrieval effect; uniform character model; Arabic Text; Error Correction; Information Retrieval; Language Modeling; OCR;

fLanguage :

English

Publisher :

ieee

Conference_Titel :

Intelligent Systems Design and Applications (ISDA), 2010 10th International Conference on

Conference_Location :

Cairo

Print_ISBN :

978-1-4244-8134-7

Type :

conf

DOI :

10.1109/ISDA.2010.5687228

Filename :

5687228

Link To Document :

https://search.ricest.ac.ir/dl/search/defaultta.aspx?DTC=49&DC=2064219