DocumentCode
3298376
Title
English and Taiwanese text categorization using N-gram based on Vector Space Model
Author
Suzuki, Makoto ; Yamagishi, Naohide ; Tsai, Yi-Ching ; Ishida, Takashi ; Goto, Masayuki
Author_Institution
Fac. of Inf. Sci., Shonan Inst. of Technol., Fujisawa, Japan
fYear
2010
fDate
17-20 Oct. 2010
Firstpage
106
Lastpage
111
Abstract
In this paper, we present a new mathematical model based on a “Vector Space Model” and consider its implications. The proposed method is evaluated by performing several experiments. In these experiments, we classify newspaper articles from the English Reuters-21578 data set, and Taiwanese China Times 2005 data set using the proposed method. The Reuters-21578 data set is a benchmark data set for automatic text categorization. It is shown that FRAM has good classification accuracy. Specifically, the micro-averaged F-measure of the proposed method is 94.5% for English. However, that is 78.0% for Taiwanese. Though the proposed method is language-independent and provides a new perspective, our future work is to improve classification accuracy for Taiwanese.
Keywords
natural language processing; pattern classification; text analysis; English Reuters-21578 data set; English text categorization; N-gram; Taiwanese China Times 2005 data set; Taiwanese classification accuracy; Taiwanese text categorization; automatic text categorization; language-independent; mathematical model; microaveraged F-measure; newspaper articles; vector space model; Accuracy; Computers; Feature extraction; Mathematical model; Nonvolatile memory; Text categorization; Training; N-gram; classification; newspaper; text mining;
fLanguage
English
Publisher
ieee
Conference_Titel
Information Theory and its Applications (ISITA), 2010 International Symposium on
Conference_Location
Taichung
Print_ISBN
978-1-4244-6016-8
Electronic_ISBN
978-1-4244-6017-5
Type
conf
DOI
10.1109/ISITA.2010.5649453
Filename
5649453
Link To Document