DocumentCode :
2145510
Title :
Error Correction with In-domain Training across Multiple OCR System Outputs
Author :
Lund, William B. ; Ringger, Eric K.
Author_Institution :
Comput. Sci. Dept., Brigham Young Univ., Provo, UT, USA
fYear :
2011
fDate :
18-21 Sept. 2011
Firstpage :
658
Lastpage :
662
Abstract :
Optical character recognition (OCR) systems differ in the types of errors they make, particularly in recognizing characters from degraded or poor quality documents. The problem is how to correct these OCR errors, which is the first step toward more effective use of the documents in digital libraries. This paper demonstrates the degree to which the word error rate (WER) can be reduced using a decision list on a combination of textual features across the aligned output of multiple OCR engines where in-domain training data is available. This research was performed on a data set for which the mean WER across the three OCR engines employed is 33.5%, and the lattice word error rate is 13.0%. Our correction method leads to a 52.2% relative decrease in the mean WER and a 19.5% relative improvement over the best single OCR engine, as well as an improvement over our previous work. Further, our method yields instances where the document WER approaches and for five documents matches the lattice word error rate, which is a theoretical lower bound given the evidence found in the OCR.
Keywords :
document image processing; error correction; optical character recognition; OCR engines; digital libraries; error correction; in-domain training; lattice word error rate; multiple OCR system outputs; optical character recognition; word error rate; Dictionaries; Engines; Error analysis; Lattices; Optical character recognition software; Speech recognition; Training; Decision lists; Multiple OCR engines; OCR error correction; Optical character recognition;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Document Analysis and Recognition (ICDAR), 2011 International Conference on
Conference_Location :
Beijing
ISSN :
1520-5363
Print_ISBN :
978-1-4577-1350-7
Electronic_ISBN :
1520-5363
Type :
conf
DOI :
10.1109/ICDAR.2011.138
Filename :
6065393
Link To Document :
بازگشت