• DocumentCode
    3060402
  • Title

    Probabilistic Approach for Correction of Optically-Character-Recognized Strings Using Suffix Tree

  • Author

    Jain, Rupi ; Chaudhury, Santanu

  • Author_Institution
    Dept. of EE, IIT Delhi, Delhi, India
  • fYear
    2011
  • fDate
    15-17 Dec. 2011
  • Firstpage
    74
  • Lastpage
    77
  • Abstract
    In this paper we present an approach for correcting character recognition errors of an OCR which can recognise Indic Scripts. Suffix tree is used to index the lexicon in lexicographical order to facilitate the probabilistic search. To obtain the best probable match against the mis-recognised string, it is compared with the sub-strings (edges of suffix tree) using similarity measure as weighted Levenshtein distance, where Confusion probabilities of characters (Unicodes) are used as substitution cost, until it exceeds the specified cost k. Retrieved candidates are sorted and selected on the basis of their lowest edit cost. Exploiting this information, the system can correct non-word errors and achieves maximum error rate reduction of 33% over simple character recognition system.
  • Keywords
    optical character recognition; probability; search problems; character recognition errors; character recognition system; confusion probabilities; optically character recognized strings; probabilistic search; suffix tree; weighted Levenshtein distance; Accuracy; Character recognition; Dictionaries; Error correction; Optical character recognition software; Probabilistic logic; Training; OCR error correction; document; probabilistic error correction; suffix tree; weighted levenshtein edit distance;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Computer Vision, Pattern Recognition, Image Processing and Graphics (NCVPRIPG), 2011 Third National Conference on
  • Conference_Location
    Hubli, Karnataka
  • Print_ISBN
    978-1-4577-2102-1
  • Type

    conf

  • DOI
    10.1109/NCVPRIPG.2011.24
  • Filename
    6133004