مرکز منطقه ای اطلاع رساني علوم و فناوري

DocumentCode :

1941607

Title :

VQ-based written language identification

Author :

Pham, Tuan ; Tran, Dat

Author_Institution :

Sch. of Comput. & Inf. Technol., Griffith Univ., Brisbane, Qld., Australia

Volume :

fYear :

2003

fDate :

1-4 July 2003

Firstpage :

513

Abstract :

Humans can recognize different types of written languages by their grammars and vocabularies. However, computers see everything as numbers. We present a computational algorithm for machine classification of written languages using the method of vector quantization. For a language document, each word is converted to a sequence of numbers and forms as a vector of numerical values according to its characters. This collection of vectors is then represented by a codebook that contains a number of template vectors for classification. The proposed method is more effective for machine learning than the n-gram based method, which has been widely used for written language identification. Experimental results of classifying a set of five closely roman-typed scripts show the promising application of the proposed method.

Keywords :

computational linguistics; document handling; grammars; language translation; learning (artificial intelligence); natural languages; program compilers; support vector machines; vector quantisation; VQ-based written language identification; codebook; computational algorithm; machine learning; roman-typed script; template vector; vector quantization; Application software; Australia; Classification algorithms; Encoding; Frequency; Humans; Information technology; Machine learning; Vector quantization; Vocabulary;

fLanguage :

English

Publisher :

ieee

Conference_Titel :

Signal Processing and Its Applications, 2003. Proceedings. Seventh International Symposium on

Print_ISBN :

0-7803-7946-2

Type :

conf

DOI :

10.1109/ISSPA.2003.1224752

Filename :

1224752

Link To Document :

https://search.ricest.ac.ir/dl/search/defaultta.aspx?DTC=49&DC=1941607