DocumentCode :
2880247
Title :
Language identification based on string kernels
Author :
Kruengkrai, Canasai ; Srichaivattana, Prapass ; Sornlertlamvanich, Virach ; Isahara, Hitoshi
Author_Institution :
Thai Comput. Linguistics Lab., National Inst. of Inf. & Commun. Technol., Thailand
Volume :
2
fYear :
2005
fDate :
12-14 Oct. 2005
Firstpage :
926
Lastpage :
929
Abstract :
In this paper, we propose a novel approach for automatically identifying the language of a given text based on the concept of string kernels. Our approach can identify the language from the text directly, regardless of its coding system. In particular, we view the text in a more fine-grained encoding as the string of bytes. The similarity between two strings can be implicitly computed through an efficient dynamic alignment using suffix trees. We provide empirical evidence that applying the string kernels to the language identification problem yields an impressive performance using two different kernel classifiers: the kernelized version of the centroid-based method and the support vector machines. Our experiments are based on a reasonable scale of the data set in terms of the number of languages to be identified, including 17 different languages.
Keywords :
natural languages; speech coding; speech recognition; support vector machines; centroid-based method; coding system; dynamic alignment; empirical evidence; fine-grained encoding; language identification problem; string kernels; suffix trees; support vector machines; Communications technology; Computational linguistics; Dictionaries; Encoding; Kernel; Laboratories; Search engines; Support vector machine classification; Support vector machines; Web pages;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Communications and Information Technology, 2005. ISCIT 2005. IEEE International Symposium on
Print_ISBN :
0-7803-9538-7
Type :
conf
DOI :
10.1109/ISCIT.2005.1567018
Filename :
1567018
Link To Document :
بازگشت