Author :
Sarungbam, Jeelen Kumar ; Kumar, Bijendra ; Choudhary, Alok
Author_Institution :
Dept. of Comput. Sci. & Eng., Amity Univ. Noida, Noida, India
Abstract :
India is a diverse country with various cultural and traditional differences. There are more than 12 distinguished different languages in the country, viz., Hindi, Bangla, Marathi, Oriya, Tamil, Telugu, Assamese, Manipuri, Gujarati, Kannada, Malayalam, Panjabi, Nepali, Tibetan, Urdu, etc. Optical Character Recognition (OCR) of Indian Languages needs to be designed in such a way that it automatically identifies the language of the input document for further processing. There are many techniques which are already implemented, but the problem lies in identifying and detecting the correct language as some of the languages uses the same or similar script. For example, the Bangla script is used to write - Bengali, Assamese and Manipuri languages. Though the scripts can be distinguished using global technique, the problem with languages having similar script still exist. To deal with this problem, a robust wavelet transform cumtemplate-matching invariant to rotation, scale and translation technique is deployed to identify the script and detect the language of the document automatically.
Keywords :
document image processing; image matching; linguistics; optical character recognition; wavelet transforms; Assamese languages; Bangla; Bengali languages; DWT; Gujarati; Hindi; Indian Languages; Indian languages; Kannada; Malayalam; Manipuri languages; Marathi; Nepali; OCR; Oriya; Panjabi; Tamil; Telugu; Tibetan; Urdu; correct language detection; correct language identification; frequently occurring character; optical character recognition; script identification; template matching; wavelet transform cumtemplate-matching invariant; Discrete wavelet transforms; Information technology; Next generation networking; Optical character recognition software; Pragmatics; Document Image Processing; Frequently occurring characters in Indian language; Language Detection; Script Identification; Template Matching; Wavelet Transformation;