Title :
A global evaluation criterion for feature selection in text categorization using Kullback-Leibler divergence
Author :
Zhen, Zhilong ; Zeng, Xiaoqin ; Wang, Haijuan ; Han, Lixin
Author_Institution :
Coll. of Comput. & Inf., Hohai Univ., Nanjing, China
Abstract :
A major difficulty of text categorization is extremely high dimensionality of text feature space. The use of feature selection techniques for large-scale text categorization task is desired for improving the accuracy and efficiency. χ2 statistic and simplified χ2 are two effective feature selection methods in text categorization. Using these two feature selection criteria, for a term, one needs to compute the local scores of the term over each category and usually takes the maximum or average value of these scores as the global term-goodness criterion. But there is no explicit explanation on how to choose maximum or average; moreover, these two operations can not reflect the degree of scatter of a term over all categories. In this paper, we propose a new global feature evaluation criterion based on Kullback-Leibler (KL) divergence for choosing informative terms since KL divergence is a widely used method to measure the differences of distributions between two categories. We conduct experiments on Reuters-21578 corpus with k-NN classifier to test the performance of the proposed method. The experimental results show that this method enhances the performance of text categorization. The novel method is similar or better than previous maximum and average on either Macro-F1 or Micro-F1.
Keywords :
feature extraction; pattern classification; text analysis; Kullback-Leibler divergence; Macro-F1; Micro-F1; feature selection; global evaluation criterion; k-NN classifier; text categorization; text feature space; Classification algorithms; Machine learning; Probability; Support vector machines; Text categorization; Kullback-Leibler divergence; chi-square statistic; feature selection; global evaluation criterion; text categorization;
Conference_Titel :
Soft Computing and Pattern Recognition (SoCPaR), 2011 International Conference of
Conference_Location :
Dalian
Print_ISBN :
978-1-4577-1195-4
DOI :
10.1109/SoCPaR.2011.6089284