English and Taiwanese text categorization using N-gram based on Vector Space Model

Author

Suzuki, Makoto ; Yamagishi, Naohide ; Tsai, Yi-Ching ; Ishida, Takashi ; Goto, Masayuki

Author_Institution

Fac. of Inf. Sci., Shonan Inst. of Technol., Fujisawa, Japan

fYear

2010

fDate

17-20 Oct. 2010

Firstpage

106

Lastpage

111

Abstract

In this paper, we present a new mathematical model based on a “Vector Space Model” and consider its implications. The proposed method is evaluated by performing several experiments. In these experiments, we classify newspaper articles from the English Reuters-21578 data set, and Taiwanese China Times 2005 data set using the proposed method. The Reuters-21578 data set is a benchmark data set for automatic text categorization. It is shown that FRAM has good classification accuracy. Specifically, the micro-averaged F-measure of the proposed method is 94.5% for English. However, that is 78.0% for Taiwanese. Though the proposed method is language-independent and provides a new perspective, our future work is to improve classification accuracy for Taiwanese.

Keywords

natural language processing; pattern classification; text analysis; English Reuters-21578 data set; English text categorization; N-gram; Taiwanese China Times 2005 data set; Taiwanese classification accuracy; Taiwanese text categorization; automatic text categorization; language-independent; mathematical model; microaveraged F-measure; newspaper articles; vector space model; Accuracy; Computers; Feature extraction; Mathematical model; Nonvolatile memory; Text categorization; Training; N-gram; classification; newspaper; text mining;

fLanguage

English

Publisher

ieee

Conference_Titel

Information Theory and its Applications (ISITA), 2010 International Symposium on

Conference_Location

Taichung

Print_ISBN

978-1-4244-6016-8

Electronic_ISBN

978-1-4244-6017-5

Type

conf

DOI

10.1109/ISITA.2010.5649453

Filename

5649453