• DocumentCode
    2016748
  • Title

    A two-stage style detection approach for printed Roman script words

  • Author

    Singh, Pawan Kumar ; Jana, Shantanu ; Sarkar, Ram ; Das, Nibaran ; Nasipuri, Mita

  • Author_Institution
    Dept. of Comput. Sci. & Eng., Jadvapur Univ., Kolkata, India
  • fYear
    2015
  • fDate
    7-8 Feb. 2015
  • Firstpage
    1
  • Lastpage
    6
  • Abstract
    Development of Optical Character Recognition (OCR) for printed Roman script is still an active area of research. Automatic Style Identification (ASI) can be used to improve the performance of OCR system and keyword spotting techniques for printed Roman script. This paper proposes a two stage font invariant technique for detection of italic, bold, underlined, normal and all capital styled words for printed Roman script. In the first stage, the technique separates the underlined words from non-underlined words. In the second stage, a 25-element feature set has been extracted from the non-underlined words to identify the other said styled words which are evaluated using multiple classifiers. The technique has been tested on 2100 words printed in five well-known fonts namely, Arial, Cambria, Calibri, Gill Sans, and Times New Roman in which each of the font contributes exactly about 420 words. Based on the identification accuracies of multiple classifiers, Multi Layer Perceptron (MLP) classifier has been chosen as the final classifier which was tested comprehensively using different folds and with different number of epochs. Overall accuracy of the system is found to be 98.25% using 3-fold cross validation scheme.
  • Keywords
    character sets; document image processing; multilayer perceptrons; optical character recognition; Arial; Calibri; Cambria; Gill Sans; MLP classifier; OCR; Times New Roman; all capital styled words; automatic style identification; bold styled words; font invariant technique; italic styled words; keyword spotting technique; multilayer perceptron classifier; multiple classifiers; optical character recognition; printed Roman script words; two-stage style detection approach; underlined words; Accuracy; Character recognition; Feature extraction; Image segmentation; Optical character recognition software; Shape; Text recognition; Automatic Style Identification; Font invariant; Multiple classifiers; Optical Character Recognition; Roman script;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Computer, Communication, Control and Information Technology (C3IT), 2015 Third International Conference on
  • Conference_Location
    Hooghly
  • Print_ISBN
    978-1-4799-4446-0
  • Type

    conf

  • DOI
    10.1109/C3IT.2015.7060110
  • Filename
    7060110