DocumentCode :
2753695
Title :
A Comparative Study on Vietnamese Text Classification Methods
Author :
Hoang, Vu Cong Duy ; Dinh, Dien ; Le Nguyen, Nguyen ; Ngo, Hung Quoc
Author_Institution :
Coll. of Natural Sci., Vietnam Nat. Univ., Ho Chi Minh City
fYear :
2007
fDate :
5-9 March 2007
Firstpage :
267
Lastpage :
273
Abstract :
Text classification concerns the problem of automatically assigning given text passages (or documents) into predefined categories (or topics). Whereas a wide range of methods have been applied to English text classification, relatively few studies have been done on Vietnamese text classification. Based on a Vietnamese news corpus, we present two different approaches for the Vietnamese text classification problem. By using the Bag Of Words - BOW and Statistical N-Gram Language Modeling - N-Gram approaches we were able to evaluate these two widely used classification approaches for our task and showed that these approaches could achieve an average of >95% accuracy with an average 79 minutes classifying time for about 14,000 documents (3 docs/sec). Additionally, we also analyze the advantages and disadvantages of each approach to find out the best method in specific circumstances.
Keywords :
classification; natural languages; statistical analysis; text analysis; Vietnamese news corpus; Vietnamese text passage classification; bag-of-words; predefined categorisation; statistical n-gram language modeling; Cities and towns; Educational institutions; Feature extraction; Information technology; Labeling; Natural languages; Resists; Support vector machine classification; Support vector machines; Text categorization; feature extraction; feature selection; k-nearest neighbours; language modeling; naïve bayes; support vector machines; text categorization; text classification;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Research, Innovation and Vision for the Future, 2007 IEEE International Conference on
Conference_Location :
Hanoi
Print_ISBN :
1-4244-0694-3
Type :
conf
DOI :
10.1109/RIVF.2007.369167
Filename :
4223084
Link To Document :
بازگشت