Title :
Refinement of index term set and improvement of classification accuracy on text categorization
Author :
Suzuki, Makoto ; Ishida, Takashi ; Goto, Masayuki
Author_Institution :
Fac. of Eng., Shonan Inst. of Technol., Fujisawa
Abstract :
In our previous paper, we proposed a new classification technique called the frequency ratio accumulation method (FRAM). This is a simple technique that adds up the ratios of term frequency among categories. However, in FRAM, the use of index terms is unlimited. Then, we adopt character N-gram as index terms improving the above-described particularity of FRAM. In the present paper, we will refine the DB of the index term set using mutual information and frequency ratio, and improve the classification accuracy. Next, the proposed method is evaluated by performing several experiments. In these experiments, we classify newspaper articles from English Reuters-21578 using FRAM. Reuters-21578 provides benchmark data in automatica text categorization. As a result, we show that it has the good classification accuracy. Specifically, the macro-averaged F-measure of the proposed method is 92.3% for Reuters-21578. Our method is language-independent and provides a new perspective and has excellent potential.
Keywords :
pattern classification; text analysis; character N-gram; classification accuracy; frequency ratio accumulation method; index term set refinement; text categorization; Electronic mail; Ferroelectric films; Frequency; Mutual information; Natural languages; Nonvolatile memory; Paper technology; Random access memory; Testing; Text categorization;
Conference_Titel :
Information Theory and Its Applications, 2008. ISITA 2008. International Symposium on
Conference_Location :
Auckland
Print_ISBN :
978-1-4244-2068-1
Electronic_ISBN :
978-1-4244-2069-8
DOI :
10.1109/ISITA.2008.4895455