• DocumentCode
    2412021
  • Title

    Word Level Script Identification in Bilingual Documents through Discriminating Features

  • Author

    Dhandra, B.V. ; Hangarge, Mallikarjun ; Hegadi, Ravindra ; Malemath, V.S.

  • Author_Institution
    P. G. Dept. of Studies & Res. in Comput., Gulbarga Univ.
  • fYear
    2007
  • fDate
    22-24 Feb. 2007
  • Firstpage
    630
  • Lastpage
    635
  • Abstract
    India is a multi-lingual and multi-script country where a line of a bilingual document page may contain text words in regional language and numerals in English. For optical character recognition (OCR) of such a document page, it is necessary to identify different script forms before running an individual OCR of the scripts. In this paper, we examine the use of discriminating features (aspect ratio, strokes, eccentricity, etc,) as a tool for determining the script at word level in three bilingual documents representing Kannada, Tamil and Devnagari containing English numerals, based on the observation that every text has the distinct visual appearance. The k-nearest neighbour algorithm is used to classify the new word images. The proposed algorithm is tested on 2500 sample words with various font styles and sizes. The results obtained are quite encouraging
  • Keywords
    document image processing; natural languages; optical character recognition; word processing; Devnagari; English; Kannada; OCR; Tamil; bilingual document; k-nearest neighbour algorithm; optical character recognition; word level script identification; Books; Character recognition; Computer science; Gabor filters; Natural languages; Optical character recognition software; Pattern recognition; Postal services; Sorting; Testing;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Signal Processing, Communications and Networking, 2007. ICSCN '07. International Conference on
  • Conference_Location
    Chennai
  • Print_ISBN
    1-4244-0997-7
  • Electronic_ISBN
    1-4244-0997-7
  • Type

    conf

  • DOI
    10.1109/ICSCN.2007.350686
  • Filename
    4156701