Title :
VQ-based written language identification
Author :
Pham, Tuan ; Tran, Dat
Author_Institution :
Sch. of Comput. & Inf. Technol., Griffith Univ., Brisbane, Qld., Australia
Abstract :
Humans can recognize different types of written languages by their grammars and vocabularies. However, computers see everything as numbers. We present a computational algorithm for machine classification of written languages using the method of vector quantization. For a language document, each word is converted to a sequence of numbers and forms as a vector of numerical values according to its characters. This collection of vectors is then represented by a codebook that contains a number of template vectors for classification. The proposed method is more effective for machine learning than the n-gram based method, which has been widely used for written language identification. Experimental results of classifying a set of five closely roman-typed scripts show the promising application of the proposed method.
Keywords :
computational linguistics; document handling; grammars; language translation; learning (artificial intelligence); natural languages; program compilers; support vector machines; vector quantisation; VQ-based written language identification; codebook; computational algorithm; machine learning; roman-typed script; template vector; vector quantization; Application software; Australia; Classification algorithms; Encoding; Frequency; Humans; Information technology; Machine learning; Vector quantization; Vocabulary;
Conference_Titel :
Signal Processing and Its Applications, 2003. Proceedings. Seventh International Symposium on
Print_ISBN :
0-7803-7946-2
DOI :
10.1109/ISSPA.2003.1224752