Title :
A comparative study on term weighting schemes for text categorization
Author :
Lan, Man ; Sung, Sam-Yuan ; Low, Hwee-Boon ; Tan, Chew-Lim
Author_Institution :
Dept. of Comput. Sci., National Univ. of Singapore, Singapore
fDate :
31 July-4 Aug. 2005
Abstract :
The term weighting scheme, which is used to convert documents into vectors in the term spaces, is a vital step in automatic text categorization. The previous studies showed that term weighting schemes dominate the performance rather than the kernel functions of SVMs for the text categorization task. In this paper, we conducted experiments to compare various term weighting schemes with SVM on two widely-used benchmark data sets. We also presented a new term weighting scheme tf.rf for text categorization. The cross-scheme comparison was performed by using McNemar´s tests. The controlled experimental results showed that the newly proposed tf.rf scheme is significantly better than other term weighting schemes. Compared with schemes related with tf factor alone, the idf factor does not improve or even decrease the term´s discriminating power for text categorization. The binary and tf.chi representations significantly underperform the other term weighting schemes.
Keywords :
pattern classification; text analysis; McNemar tests; document conversion; idf factor; term spaces vector; term weighting; text categorization; tf.chi representations; tf.rf; Benchmark testing; Computer science; Drives; Frequency; Kernel; Performance evaluation; Support vector machine classification; Support vector machines; Tellurium; Text categorization;
Conference_Titel :
Neural Networks, 2005. IJCNN '05. Proceedings. 2005 IEEE International Joint Conference on
Print_ISBN :
0-7803-9048-2
DOI :
10.1109/IJCNN.2005.1555890