DocumentCode :
3298376
Title :
English and Taiwanese text categorization using N-gram based on Vector Space Model
Author :
Suzuki, Makoto ; Yamagishi, Naohide ; Tsai, Yi-Ching ; Ishida, Takashi ; Goto, Masayuki
Author_Institution :
Fac. of Inf. Sci., Shonan Inst. of Technol., Fujisawa, Japan
fYear :
2010
fDate :
17-20 Oct. 2010
Firstpage :
106
Lastpage :
111
Abstract :
In this paper, we present a new mathematical model based on a “Vector Space Model” and consider its implications. The proposed method is evaluated by performing several experiments. In these experiments, we classify newspaper articles from the English Reuters-21578 data set, and Taiwanese China Times 2005 data set using the proposed method. The Reuters-21578 data set is a benchmark data set for automatic text categorization. It is shown that FRAM has good classification accuracy. Specifically, the micro-averaged F-measure of the proposed method is 94.5% for English. However, that is 78.0% for Taiwanese. Though the proposed method is language-independent and provides a new perspective, our future work is to improve classification accuracy for Taiwanese.
Keywords :
natural language processing; pattern classification; text analysis; English Reuters-21578 data set; English text categorization; N-gram; Taiwanese China Times 2005 data set; Taiwanese classification accuracy; Taiwanese text categorization; automatic text categorization; language-independent; mathematical model; microaveraged F-measure; newspaper articles; vector space model; Accuracy; Computers; Feature extraction; Mathematical model; Nonvolatile memory; Text categorization; Training; N-gram; classification; newspaper; text mining;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Information Theory and its Applications (ISITA), 2010 International Symposium on
Conference_Location :
Taichung
Print_ISBN :
978-1-4244-6016-8
Electronic_ISBN :
978-1-4244-6017-5
Type :
conf
DOI :
10.1109/ISITA.2010.5649453
Filename :
5649453
Link To Document :
بازگشت