Title :
Comparative analysis on feature selection based Bayesian text classification
Author :
Guang Yang ; Zhong-Yi Lin ; Yu-Xin Chang ; Lei Wang ; Jin-Kun Tian
Author_Institution :
Run Technol. Co., Ltd., Beijing, China
Abstract :
Feature selection is an important preprocessing step for data in the classification and regression learning. Many feature selection algorithms have been proposed using the different information criteria based on mutual information. However, there is no such comparative study conducted to analyse the effectiveness of these methods under a specific application framework. In this paper, we select 6 different feature selection algorithms, i.e, RelFss, MIFS-U, FCBF, CMIM, mRMR, and mMIFS-U, to compare their reduction capabilities and classification performances in the application of naive Bayesian based text classification. We collect a lot of documents belonging to ten different domains from the Chinese News Web site (www.people.com.cn) as the experimental data, where each of documents includes 1,000 Chinese characters at least. From the experimental results, we can conclude that naive Bayesian with the features selected by mRMR can obtain the highest classification accuracy. The summarized conclusions give some guidelines for feature selection in text classification application.
Keywords :
belief networks; pattern classification; text analysis; CMIM algorithm; Chinese news Web site; FCBF algorithm; MIFS-U algorithm; RelFss algorithm; classification accuracy; classification performance; feature selection algorithms; information criteria; mMIFS-U algorithm; mRMR algorithm; mutual information; naive Bayesian based text classification; regression learning; Feature selection; mutual information; naive Bayesian classifier; text classification;
Conference_Titel :
Computer Science and Network Technology (ICCSNT), 2012 2nd International Conference on
Conference_Location :
Changchun
Print_ISBN :
978-1-4673-2963-7
DOI :
10.1109/ICCSNT.2012.6526137