DocumentCode :
146518
Title :
Script identification and language detection of 12 Indian languages using DWT and template matching of Frequently Occurring Character(s)
Author :
Sarungbam, Jeelen Kumar ; Kumar, Bijendra ; Choudhary, Alok
Author_Institution :
Dept. of Comput. Sci. & Eng., Amity Univ. Noida, Noida, India
fYear :
2014
fDate :
25-26 Sept. 2014
Firstpage :
669
Lastpage :
674
Abstract :
India is a diverse country with various cultural and traditional differences. There are more than 12 distinguished different languages in the country, viz., Hindi, Bangla, Marathi, Oriya, Tamil, Telugu, Assamese, Manipuri, Gujarati, Kannada, Malayalam, Panjabi, Nepali, Tibetan, Urdu, etc. Optical Character Recognition (OCR) of Indian Languages needs to be designed in such a way that it automatically identifies the language of the input document for further processing. There are many techniques which are already implemented, but the problem lies in identifying and detecting the correct language as some of the languages uses the same or similar script. For example, the Bangla script is used to write - Bengali, Assamese and Manipuri languages. Though the scripts can be distinguished using global technique, the problem with languages having similar script still exist. To deal with this problem, a robust wavelet transform cumtemplate-matching invariant to rotation, scale and translation technique is deployed to identify the script and detect the language of the document automatically.
Keywords :
document image processing; image matching; linguistics; optical character recognition; wavelet transforms; Assamese languages; Bangla; Bengali languages; DWT; Gujarati; Hindi; Indian Languages; Indian languages; Kannada; Malayalam; Manipuri languages; Marathi; Nepali; OCR; Oriya; Panjabi; Tamil; Telugu; Tibetan; Urdu; correct language detection; correct language identification; frequently occurring character; optical character recognition; script identification; template matching; wavelet transform cumtemplate-matching invariant; Discrete wavelet transforms; Information technology; Next generation networking; Optical character recognition software; Pragmatics; Document Image Processing; Frequently occurring characters in Indian language; Language Detection; Script Identification; Template Matching; Wavelet Transformation;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Confluence The Next Generation Information Technology Summit (Confluence), 2014 5th International Conference -
Conference_Location :
Noida
Print_ISBN :
978-1-4799-4237-4
Type :
conf
DOI :
10.1109/CONFLUENCE.2014.6949300
Filename :
6949300
Link To Document :
بازگشت