• DocumentCode
    3298376
  • Title

    English and Taiwanese text categorization using N-gram based on Vector Space Model

  • Author

    Suzuki, Makoto ; Yamagishi, Naohide ; Tsai, Yi-Ching ; Ishida, Takashi ; Goto, Masayuki

  • Author_Institution
    Fac. of Inf. Sci., Shonan Inst. of Technol., Fujisawa, Japan
  • fYear
    2010
  • fDate
    17-20 Oct. 2010
  • Firstpage
    106
  • Lastpage
    111
  • Abstract
    In this paper, we present a new mathematical model based on a “Vector Space Model” and consider its implications. The proposed method is evaluated by performing several experiments. In these experiments, we classify newspaper articles from the English Reuters-21578 data set, and Taiwanese China Times 2005 data set using the proposed method. The Reuters-21578 data set is a benchmark data set for automatic text categorization. It is shown that FRAM has good classification accuracy. Specifically, the micro-averaged F-measure of the proposed method is 94.5% for English. However, that is 78.0% for Taiwanese. Though the proposed method is language-independent and provides a new perspective, our future work is to improve classification accuracy for Taiwanese.
  • Keywords
    natural language processing; pattern classification; text analysis; English Reuters-21578 data set; English text categorization; N-gram; Taiwanese China Times 2005 data set; Taiwanese classification accuracy; Taiwanese text categorization; automatic text categorization; language-independent; mathematical model; microaveraged F-measure; newspaper articles; vector space model; Accuracy; Computers; Feature extraction; Mathematical model; Nonvolatile memory; Text categorization; Training; N-gram; classification; newspaper; text mining;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Information Theory and its Applications (ISITA), 2010 International Symposium on
  • Conference_Location
    Taichung
  • Print_ISBN
    978-1-4244-6016-8
  • Electronic_ISBN
    978-1-4244-6017-5
  • Type

    conf

  • DOI
    10.1109/ISITA.2010.5649453
  • Filename
    5649453