DocumentCode
2016748
Title
A two-stage style detection approach for printed Roman script words
Author
Singh, Pawan Kumar ; Jana, Shantanu ; Sarkar, Ram ; Das, Nibaran ; Nasipuri, Mita
Author_Institution
Dept. of Comput. Sci. & Eng., Jadvapur Univ., Kolkata, India
fYear
2015
fDate
7-8 Feb. 2015
Firstpage
1
Lastpage
6
Abstract
Development of Optical Character Recognition (OCR) for printed Roman script is still an active area of research. Automatic Style Identification (ASI) can be used to improve the performance of OCR system and keyword spotting techniques for printed Roman script. This paper proposes a two stage font invariant technique for detection of italic, bold, underlined, normal and all capital styled words for printed Roman script. In the first stage, the technique separates the underlined words from non-underlined words. In the second stage, a 25-element feature set has been extracted from the non-underlined words to identify the other said styled words which are evaluated using multiple classifiers. The technique has been tested on 2100 words printed in five well-known fonts namely, Arial, Cambria, Calibri, Gill Sans, and Times New Roman in which each of the font contributes exactly about 420 words. Based on the identification accuracies of multiple classifiers, Multi Layer Perceptron (MLP) classifier has been chosen as the final classifier which was tested comprehensively using different folds and with different number of epochs. Overall accuracy of the system is found to be 98.25% using 3-fold cross validation scheme.
Keywords
character sets; document image processing; multilayer perceptrons; optical character recognition; Arial; Calibri; Cambria; Gill Sans; MLP classifier; OCR; Times New Roman; all capital styled words; automatic style identification; bold styled words; font invariant technique; italic styled words; keyword spotting technique; multilayer perceptron classifier; multiple classifiers; optical character recognition; printed Roman script words; two-stage style detection approach; underlined words; Accuracy; Character recognition; Feature extraction; Image segmentation; Optical character recognition software; Shape; Text recognition; Automatic Style Identification; Font invariant; Multiple classifiers; Optical Character Recognition; Roman script;
fLanguage
English
Publisher
ieee
Conference_Titel
Computer, Communication, Control and Information Technology (C3IT), 2015 Third International Conference on
Conference_Location
Hooghly
Print_ISBN
978-1-4799-4446-0
Type
conf
DOI
10.1109/C3IT.2015.7060110
Filename
7060110
Link To Document