DocumentCode
1659097
Title
An efficient method of language identification using LVQ network
Author
Xiao, Han ; Yu, Lei ; Chen, Kai
Author_Institution
Sch. of Inf. Eng., Beijing Univ. of Posts & Telecommun., Beijing
fYear
2008
Firstpage
1690
Lastpage
1694
Abstract
This paper presents a new method to identify languages. A LVQ (learning vector quantization) network aimed at language identification is introduced. The presence of particular characters, words and the statistical information of word lengths are used as a feature vector. The new classification technique is faster than the conventional N-gram based classification approach, but it performs similarly in correct classification rate. In an identification experiment with 8 Roman alphabet languages, the LVQ network achieved 97.6% correct classification rate with 500 bytes, but it is five times faster than N-gram based approach.
Keywords
classification; feature extraction; learning (artificial intelligence); natural languages; text analysis; vector quantisation; Roman alphabet languages; feature extraction; feature vector; language identification; learning vector quantization; word lengths; Books; Data mining; Feature extraction; Frequency; Natural languages; Organizing; Statistical distributions; Statistics; Vector quantization; Web and internet services;
fLanguage
English
Publisher
ieee
Conference_Titel
Signal Processing, 2008. ICSP 2008. 9th International Conference on
Conference_Location
Beijing
Print_ISBN
978-1-4244-2178-7
Electronic_ISBN
978-1-4244-2179-4
Type
conf
DOI
10.1109/ICOSP.2008.4697462
Filename
4697462
Link To Document