• DocumentCode
    2000759
  • Title

    Heuristic based script identification from multilingual text documents

  • Author

    Das, M. Swamy ; Rani, D. Sandhya ; Reddy, C.R.K.

  • Author_Institution
    Dept. of Comput. Sci. & Eng., CBIT, Hyderabad, India
  • fYear
    2012
  • fDate
    15-17 March 2012
  • Firstpage
    487
  • Lastpage
    492
  • Abstract
    A multilingual document may contain text words in more than one language. In a multilingual country like India it is necessary that a document should be composed of text contents in different languages in order to reach a larger cross section of people, But on the other hand, this causes practical difficulty in OCRing such a document, because the language type of the text should be pre-determined, before employing a particular OCR (Optical Character Recognition). It is perhaps impossible to design a single recognizer which can identify a large number of scripts/languages. So, it is necessary to identify the language region of the document before feeding the document to the corresponding OCR system. Script identification aims to extract information presented in digital documents namely articles, newspapers, magazines and e-books. This has given rise to many language identification systems. The objective of this paper is to propose a model to identify script type of different text portions using visual clues. In this work seven feature namely bottom max row, top horizontal lines, vertical lines, bottom components, tick components, top holes and bottom holes have been used to identify the script type. In this work, multilingual documents with Telugu, English and Hindi scripts have been used. From the experimentation it is understood that the identification accuracy of above 93% is achieved.
  • Keywords
    optical character recognition; text analysis; English script; Hindi script; India; OCR system; Telugu script; articles; digital documents; e-books; heuristic based script identification; language identification systems; magazines; multilingual country; multilingual documents; multilingual text documents; newspapers; optical character recognition; script type identification; text contents; text words; visual clues; Feature extraction; Gabor filters; Information technology; Internet; Optical character recognition software; Shape; Visualization; OCR; Visual features; pipe density; profiles; tick components;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Recent Advances in Information Technology (RAIT), 2012 1st International Conference on
  • Conference_Location
    Dhanbad
  • Print_ISBN
    978-1-4577-0694-3
  • Type

    conf

  • DOI
    10.1109/RAIT.2012.6194627
  • Filename
    6194627