DocumentCode :
153340
Title :
Context-Dependent Confusions Rules for Building Error Model Using Weighted Finite State Transducers for OCR Post-Processing
Author :
Al Azawi, Mayce ; Breuel, Thomas M.
Author_Institution :
Univ. of Kaiserslautern, Kaiserslautern, Germany
fYear :
2014
fDate :
7-10 April 2014
Firstpage :
116
Lastpage :
120
Abstract :
In this paper, we propose a new technique to correct the OCR errors by means of weighted finite state transducers(WFST) with context-dependent confusion rules. We translate the OCR confusions which appear in the recognition outputs into edit operations, e.g. insertions, deletions and substitutions using Levenshtein edit distance algorithm. The edit operations are extracted in a form of rules with respect to the context of the incorrect string to build an error model using weighted finite state transducers. The context-dependent rules help to fit the rule in the appropriate strings. Our new error model avoids the calculations that occur in searching the language model and it also makes the language model eligible to correct incorrect words by using context-dependent confusion rules. Our approach is language independent. It designed to deal with different number of errors. It has no limited words size. In the set of experiments conducted on the ocred pages from the UWIII dataset, our new proposed error model outperforms. The evaluation shows the error rate of our model on the UWIII testset is 0.68%, while the baseline is 1.14% and the error rate of the existing state-of-the-art single character rules-based approach is 1.0%.
Keywords :
error correction; finite state machines; optical character recognition; text editing; Levenshtein edit distance algorithm; OCR confusions; OCR error correction; OCR post-processing; UWIII dataset; WFST; context-dependent confusion rules; error model; language independent approach; language model; single character rule-based approach; weighted finite state transducers; Automata; Computational modeling; Context; Context modeling; Dictionaries; Optical character recognition software; Transducers; Context-Dependent Rules; Error Model; Language Model; OCR; WFST;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Document Analysis Systems (DAS), 2014 11th IAPR International Workshop on
Conference_Location :
Tours
Print_ISBN :
978-1-4799-3243-6
Type :
conf
DOI :
10.1109/DAS.2014.75
Filename :
6830981
Link To Document :
بازگشت