Title :
A two-stage style detection approach for printed Roman script words
Author :
Singh, Pawan Kumar ; Jana, Shantanu ; Sarkar, Ram ; Das, Nibaran ; Nasipuri, Mita
Author_Institution :
Dept. of Comput. Sci. & Eng., Jadvapur Univ., Kolkata, India
Abstract :
Development of Optical Character Recognition (OCR) for printed Roman script is still an active area of research. Automatic Style Identification (ASI) can be used to improve the performance of OCR system and keyword spotting techniques for printed Roman script. This paper proposes a two stage font invariant technique for detection of italic, bold, underlined, normal and all capital styled words for printed Roman script. In the first stage, the technique separates the underlined words from non-underlined words. In the second stage, a 25-element feature set has been extracted from the non-underlined words to identify the other said styled words which are evaluated using multiple classifiers. The technique has been tested on 2100 words printed in five well-known fonts namely, Arial, Cambria, Calibri, Gill Sans, and Times New Roman in which each of the font contributes exactly about 420 words. Based on the identification accuracies of multiple classifiers, Multi Layer Perceptron (MLP) classifier has been chosen as the final classifier which was tested comprehensively using different folds and with different number of epochs. Overall accuracy of the system is found to be 98.25% using 3-fold cross validation scheme.
Keywords :
character sets; document image processing; multilayer perceptrons; optical character recognition; Arial; Calibri; Cambria; Gill Sans; MLP classifier; OCR; Times New Roman; all capital styled words; automatic style identification; bold styled words; font invariant technique; italic styled words; keyword spotting technique; multilayer perceptron classifier; multiple classifiers; optical character recognition; printed Roman script words; two-stage style detection approach; underlined words; Accuracy; Character recognition; Feature extraction; Image segmentation; Optical character recognition software; Shape; Text recognition; Automatic Style Identification; Font invariant; Multiple classifiers; Optical Character Recognition; Roman script;
Conference_Titel :
Computer, Communication, Control and Information Technology (C3IT), 2015 Third International Conference on
Conference_Location :
Hooghly
Print_ISBN :
978-1-4799-4446-0
DOI :
10.1109/C3IT.2015.7060110