Title :
Error Correction with In-domain Training across Multiple OCR System Outputs
Author :
Lund, William B. ; Ringger, Eric K.
Author_Institution :
Comput. Sci. Dept., Brigham Young Univ., Provo, UT, USA
Abstract :
Optical character recognition (OCR) systems differ in the types of errors they make, particularly in recognizing characters from degraded or poor quality documents. The problem is how to correct these OCR errors, which is the first step toward more effective use of the documents in digital libraries. This paper demonstrates the degree to which the word error rate (WER) can be reduced using a decision list on a combination of textual features across the aligned output of multiple OCR engines where in-domain training data is available. This research was performed on a data set for which the mean WER across the three OCR engines employed is 33.5%, and the lattice word error rate is 13.0%. Our correction method leads to a 52.2% relative decrease in the mean WER and a 19.5% relative improvement over the best single OCR engine, as well as an improvement over our previous work. Further, our method yields instances where the document WER approaches and for five documents matches the lattice word error rate, which is a theoretical lower bound given the evidence found in the OCR.
Keywords :
document image processing; error correction; optical character recognition; OCR engines; digital libraries; error correction; in-domain training; lattice word error rate; multiple OCR system outputs; optical character recognition; word error rate; Dictionaries; Engines; Error analysis; Lattices; Optical character recognition software; Speech recognition; Training; Decision lists; Multiple OCR engines; OCR error correction; Optical character recognition;
Conference_Titel :
Document Analysis and Recognition (ICDAR), 2011 International Conference on
Conference_Location :
Beijing
Print_ISBN :
978-1-4577-1350-7
Electronic_ISBN :
1520-5363
DOI :
10.1109/ICDAR.2011.138