مرکز منطقه ای اطلاع رساني علوم و فناوري - Script identification and language detection of 12 Indian languages using DWT and template matching of Frequently Occurring Character(s)

DocumentCode :

146518

Title :

Script identification and language detection of 12 Indian languages using DWT and template matching of Frequently Occurring Character(s)

Author :

Sarungbam, Jeelen Kumar ; Kumar, Bijendra ; Choudhary, Alok

Author_Institution :

Dept. of Comput. Sci. & Eng., Amity Univ. Noida, Noida, India

fYear :

2014

fDate :

25-26 Sept. 2014

Firstpage :

669

Lastpage :

674

Abstract :

India is a diverse country with various cultural and traditional differences. There are more than 12 distinguished different languages in the country, viz., Hindi, Bangla, Marathi, Oriya, Tamil, Telugu, Assamese, Manipuri, Gujarati, Kannada, Malayalam, Panjabi, Nepali, Tibetan, Urdu, etc. Optical Character Recognition (OCR) of Indian Languages needs to be designed in such a way that it automatically identifies the language of the input document for further processing. There are many techniques which are already implemented, but the problem lies in identifying and detecting the correct language as some of the languages uses the same or similar script. For example, the Bangla script is used to write - Bengali, Assamese and Manipuri languages. Though the scripts can be distinguished using global technique, the problem with languages having similar script still exist. To deal with this problem, a robust wavelet transform cumtemplate-matching invariant to rotation, scale and translation technique is deployed to identify the script and detect the language of the document automatically.

Keywords :

document image processing; image matching; linguistics; optical character recognition; wavelet transforms; Assamese languages; Bangla; Bengali languages; DWT; Gujarati; Hindi; Indian Languages; Indian languages; Kannada; Malayalam; Manipuri languages; Marathi; Nepali; OCR; Oriya; Panjabi; Tamil; Telugu; Tibetan; Urdu; correct language detection; correct language identification; frequently occurring character; optical character recognition; script identification; template matching; wavelet transform cumtemplate-matching invariant; Discrete wavelet transforms; Information technology; Next generation networking; Optical character recognition software; Pragmatics; Document Image Processing; Frequently occurring characters in Indian language; Language Detection; Script Identification; Template Matching; Wavelet Transformation;

fLanguage :

English

Publisher :

ieee

Conference_Titel :

Confluence The Next Generation Information Technology Summit (Confluence), 2014 5th International Conference -

Conference_Location :

Noida

Print_ISBN :

978-1-4799-4237-4

Type :

conf

DOI :

10.1109/CONFLUENCE.2014.6949300

Filename :

6949300

Link To Document :

https://search.ricest.ac.ir/dl/search/defaultta.aspx?DTC=49&DC=146518