• DocumentCode
    2825464
  • Title

    Information Retrieval Based on OCR Errors in Scanned Documents

  • Author

    Fataicha, Y. ; Cheriet, M. ; Nie, J.Y. ; Suen, C.Y.

  • Author_Institution
    Ã\x89cole de Technologie Supérieure de Montréal, Québec, Canada; Université de Montréal, Québec, Canada
  • Volume
    3
  • fYear
    2003
  • fDate
    16-22 June 2003
  • Firstpage
    25
  • Lastpage
    25
  • Abstract
    An important proportion of documents are document images, i.e. scanned documents. For their retrieval, it is important to recognize their contents. Current technologies for optical character recognition (OCR) and document analysis do not handle such documents adequately because of the recognition errors. In this paper, we describe an approach that integrates the detection of errors in scanned texts without relying on a lexicon, and this detection is integrated in the research process. The proposed algorithm consists of two basic steps. In the first step, we apply editing operations on OCR words that generate a collection of error-grams and correction rules. The second step uses query terms, error-grams, and correction rules to create searchable keywords, identify appropriate matching terms, and determine the degree of relevance of retrieved document images. Algorithms has been tested on 979 document images provided by Media-team databases from Washington University, and the experimental results obtained show the effectiveness of our method and indicate improvement in comparison with the standard methods such as exact or partial matching, N-gram overlaps, and Q-gram distance.
  • Keywords
    Character recognition; Computer errors; Content based retrieval; Error correction; Image databases; Image retrieval; Information retrieval; Optical character recognition software; Testing; Text analysis; Image document; Matching; N-gram statistics; OCR; String; confusion probability; information retrieval; query term expansion; text processing;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Computer Vision and Pattern Recognition Workshop, 2003. CVPRW '03. Conference on
  • Conference_Location
    Madison, Wisconsin, USA
  • ISSN
    1063-6919
  • Print_ISBN
    0-7695-1900-8
  • Type

    conf

  • DOI
    10.1109/CVPRW.2003.10020
  • Filename
    4624283