• DocumentCode
    2145510
  • Title

    Error Correction with In-domain Training across Multiple OCR System Outputs

  • Author

    Lund, William B. ; Ringger, Eric K.

  • Author_Institution
    Comput. Sci. Dept., Brigham Young Univ., Provo, UT, USA
  • fYear
    2011
  • fDate
    18-21 Sept. 2011
  • Firstpage
    658
  • Lastpage
    662
  • Abstract
    Optical character recognition (OCR) systems differ in the types of errors they make, particularly in recognizing characters from degraded or poor quality documents. The problem is how to correct these OCR errors, which is the first step toward more effective use of the documents in digital libraries. This paper demonstrates the degree to which the word error rate (WER) can be reduced using a decision list on a combination of textual features across the aligned output of multiple OCR engines where in-domain training data is available. This research was performed on a data set for which the mean WER across the three OCR engines employed is 33.5%, and the lattice word error rate is 13.0%. Our correction method leads to a 52.2% relative decrease in the mean WER and a 19.5% relative improvement over the best single OCR engine, as well as an improvement over our previous work. Further, our method yields instances where the document WER approaches and for five documents matches the lattice word error rate, which is a theoretical lower bound given the evidence found in the OCR.
  • Keywords
    document image processing; error correction; optical character recognition; OCR engines; digital libraries; error correction; in-domain training; lattice word error rate; multiple OCR system outputs; optical character recognition; word error rate; Dictionaries; Engines; Error analysis; Lattices; Optical character recognition software; Speech recognition; Training; Decision lists; Multiple OCR engines; OCR error correction; Optical character recognition;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Document Analysis and Recognition (ICDAR), 2011 International Conference on
  • Conference_Location
    Beijing
  • ISSN
    1520-5363
  • Print_ISBN
    978-1-4577-1350-7
  • Electronic_ISBN
    1520-5363
  • Type

    conf

  • DOI
    10.1109/ICDAR.2011.138
  • Filename
    6065393