Title :
OCR with no shape training
Author :
Ho, Tin Kam ; Nagy, George
Author_Institution :
Lecent Technol. Bell Labs., Murray Hill, NJ, USA
Abstract :
We present a document-specific OCR system and apply it to a corpus of fixed business letters. Unsupervised classification of the segmented character bitmaps on each page, using a “clump” metric, typically yields several hundred clusters with highly skewed populations. Letter identities are assigned to each cluster by maximizing matches with a lexicon of English words. We found that for 2/3 of the pages, we can identify almost 80% of the words included in the lexicon, without any shape training. Residual errors are caused by mis-segmentation including missed lines and punctuation. This research differs from earlier attempts to apply cipher decoding to OCR in: (1) using real data; (2) a more appropriate clustering algorithm; and (3) decoding a many-to-many instead of a one-to-one mapping between clusters and letters
Keywords :
document image processing; image classification; optical character recognition; business letters; document-specific OCR system; highly skewed populations; letter identities; many-to-many mapping; segmented character bitmaps; unsupervised classification; Business; Clustering algorithms; Decoding; Optical character recognition software; Prototypes; Robustness; Scattering; Shape; Tin; USA Councils;
Conference_Titel :
Pattern Recognition, 2000. Proceedings. 15th International Conference on
Conference_Location :
Barcelona
Print_ISBN :
0-7695-0750-6
DOI :
10.1109/ICPR.2000.902858