DocumentCode
2880247
Title
Language identification based on string kernels
Author
Kruengkrai, Canasai ; Srichaivattana, Prapass ; Sornlertlamvanich, Virach ; Isahara, Hitoshi
Author_Institution
Thai Comput. Linguistics Lab., National Inst. of Inf. & Commun. Technol., Thailand
Volume
2
fYear
2005
fDate
12-14 Oct. 2005
Firstpage
926
Lastpage
929
Abstract
In this paper, we propose a novel approach for automatically identifying the language of a given text based on the concept of string kernels. Our approach can identify the language from the text directly, regardless of its coding system. In particular, we view the text in a more fine-grained encoding as the string of bytes. The similarity between two strings can be implicitly computed through an efficient dynamic alignment using suffix trees. We provide empirical evidence that applying the string kernels to the language identification problem yields an impressive performance using two different kernel classifiers: the kernelized version of the centroid-based method and the support vector machines. Our experiments are based on a reasonable scale of the data set in terms of the number of languages to be identified, including 17 different languages.
Keywords
natural languages; speech coding; speech recognition; support vector machines; centroid-based method; coding system; dynamic alignment; empirical evidence; fine-grained encoding; language identification problem; string kernels; suffix trees; support vector machines; Communications technology; Computational linguistics; Dictionaries; Encoding; Kernel; Laboratories; Search engines; Support vector machine classification; Support vector machines; Web pages;
fLanguage
English
Publisher
ieee
Conference_Titel
Communications and Information Technology, 2005. ISCIT 2005. IEEE International Symposium on
Print_ISBN
0-7803-9538-7
Type
conf
DOI
10.1109/ISCIT.2005.1567018
Filename
1567018
Link To Document