Title :
Segmentation of Degraded Malayalam Words: Methods and Evaluation
Author :
Sachan, Devendra ; Dutta, Shrey ; Naveen, T.S. ; Jawahar, C.V.
Author_Institution :
Center for Visual Inf. Technol., IIIT Hyderabad, Hyderabad, India
Abstract :
In most of the Optical Character Recognition softwares, a substantial percentage of errors are caused by the incorrect segmentation of degraded words. This is especially true for recognizing old books, newspapers and historical manuscripts. In this paper, we propose multiple segmentation methods which address the problem of cuts and merges in degraded words. We have created an annotated dataset of 1034 word images with pixel level ground truth for quantitative evaluation of the methods. We compare the methods with a baseline implementation based on connected component analysis. We report substantial improvement in accuracy both at character and at word level.
Keywords :
image segmentation; natural language processing; optical character recognition; statistical analysis; baseline implementation; connected component analysis; degraded Malayalam words; historical manuscripts; multiple segmentation methods; newspapers; optical character recognition softwares; pixel level ground truth; Accuracy; Algorithm design and analysis; Databases; Degradation; Image segmentation; Optical character recognition software; Transforms; Character Segmentation; Degradation Correction; Indian Language; Malayalam;
Conference_Titel :
Computer Vision, Pattern Recognition, Image Processing and Graphics (NCVPRIPG), 2011 Third National Conference on
Conference_Location :
Hubli, Karnataka
Print_ISBN :
978-1-4577-2102-1
DOI :
10.1109/NCVPRIPG.2011.23