Title :
Word discrimination based on bigram co-occurrences
Author :
El-Nasan, Adnan ; Veeramachaneni, Sriharsha ; Nagy, George
Author_Institution :
DocLab, Rensselaer Polytech. Inst., Troy, NY, USA
fDate :
6/23/1905 12:00:00 AM
Abstract :
Very few pairs of English words share exactly the same letter bigrams. This linguistic property can be exploited to bring lexical context into the classification stage of a word recognition system. The lexical n-gram matches between every word in a lexicon and a subset of reference words can be precomputed. If a match function can detect matching segments of at least n-gram length from the feature representation of words, then an unknown word can be recognized by determining the subset of reference words having an n-gram match at the feature level with the unknown word. We show that with a reasonable number of reference words, bigrams represent the best compromise between the recall ability of single letters and the precision of trigrams. Our simulations indicate that using a longer reference list can compensate errors in feature extraction. The algorithm is fast enough, even with a slow processor, for human-computer interaction
Keywords :
document image processing; feature extraction; image matching; linguistics; optical character recognition; English words; OCR; bigram co-occurrences; classification; feature extraction; feature representation; human-computer interaction; lexical context; lexical n-gram matches; linguistic property; reference list; segment matching; word discrimination; word recognition system; Degradation; Dictionaries; Entropy; Feature extraction; Matrix converters; Optical character recognition software; Probability; Statistics; Viterbi algorithm; Vocabulary;
Conference_Titel :
Document Analysis and Recognition, 2001. Proceedings. Sixth International Conference on
Conference_Location :
Seattle, WA
Print_ISBN :
0-7695-1263-1
DOI :
10.1109/ICDAR.2001.953773