• DocumentCode
    2765343
  • Title

    Text Line Identification from a Multilingual Document

  • Author

    Vijaya, P.A. ; Padma, M.C.

  • Author_Institution
    Dept. of E & C Eng., Malnad Coll. of Eng., Hassan, India
  • fYear
    2009
  • fDate
    7-9 March 2009
  • Firstpage
    302
  • Lastpage
    305
  • Abstract
    In India, a document may contain text lines in more than one language forms. For optical character recognition (OCR) of such a multilingual document, it is necessary to identify different language forms of the input document, before feeding the documents to the OCRs of individual language. In this paper, a simple but efficient technique of language identification for Kannada, Hindi and English text lines from a printed document is presented. The proposed system is based on the characteristic features of top-profile and bottom-profile of individual text lines of the input document image. The feature extraction is achieved by finding the behavior of the characteristics of the top and bottom profiles of individual text lines. The system is trained to learn the behavior of the top and bottom profiles with a training data set of 800 text lines. Range of feature values of top and bottom profiles for all the three languages are obtained and stored in knowledge base for later use during decision-making. For a new text line, necessary features are extracted from the top and bottom profiles and the feature values obtained are compared with the stored knowledge base. A new text line is classified to the type of the language that falls within that range. The proposed system is tested on 600 text lines and an overall classification accuracy of 96.6% is achieved.
  • Keywords
    document image processing; feature extraction; learning (artificial intelligence); optical character recognition; text analysis; OCR; decision-making; feature extraction; knowledge base; language identification; learning; multilingual document image; optical character recognition; text line identification; Character recognition; Data mining; Decision making; Digital images; Educational institutions; Feature extraction; Natural languages; Optical character recognition software; System testing; Training data; Bottom Profile; Document Image Processing; Feature extraction; Language Identification; Multi-lingual document; Top Profile;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Digital Image Processing, 2009 International Conference on
  • Conference_Location
    Bangkok
  • Print_ISBN
    978-0-7695-3565-4
  • Type

    conf

  • DOI
    10.1109/ICDIP.2009.51
  • Filename
    5190583