DocumentCode
2145510
Title
Error Correction with In-domain Training across Multiple OCR System Outputs
Author
Lund, William B. ; Ringger, Eric K.
Author_Institution
Comput. Sci. Dept., Brigham Young Univ., Provo, UT, USA
fYear
2011
fDate
18-21 Sept. 2011
Firstpage
658
Lastpage
662
Abstract
Optical character recognition (OCR) systems differ in the types of errors they make, particularly in recognizing characters from degraded or poor quality documents. The problem is how to correct these OCR errors, which is the first step toward more effective use of the documents in digital libraries. This paper demonstrates the degree to which the word error rate (WER) can be reduced using a decision list on a combination of textual features across the aligned output of multiple OCR engines where in-domain training data is available. This research was performed on a data set for which the mean WER across the three OCR engines employed is 33.5%, and the lattice word error rate is 13.0%. Our correction method leads to a 52.2% relative decrease in the mean WER and a 19.5% relative improvement over the best single OCR engine, as well as an improvement over our previous work. Further, our method yields instances where the document WER approaches and for five documents matches the lattice word error rate, which is a theoretical lower bound given the evidence found in the OCR.
Keywords
document image processing; error correction; optical character recognition; OCR engines; digital libraries; error correction; in-domain training; lattice word error rate; multiple OCR system outputs; optical character recognition; word error rate; Dictionaries; Engines; Error analysis; Lattices; Optical character recognition software; Speech recognition; Training; Decision lists; Multiple OCR engines; OCR error correction; Optical character recognition;
fLanguage
English
Publisher
ieee
Conference_Titel
Document Analysis and Recognition (ICDAR), 2011 International Conference on
Conference_Location
Beijing
ISSN
1520-5363
Print_ISBN
978-1-4577-1350-7
Electronic_ISBN
1520-5363
Type
conf
DOI
10.1109/ICDAR.2011.138
Filename
6065393
Link To Document