DocumentCode :
585847
Title :
A comparative study on feature selection in Chinese Spam Filtering
Author :
Xu, Yan
Author_Institution :
Beijing Language & Culture Univ., Beijing, China
fYear :
2012
fDate :
17-19 Oct. 2012
Firstpage :
1
Lastpage :
6
Abstract :
Feature selection plays an important role in Spam Filtering. Automatic feature selection methods such as document frequency thresholding (DF), information gain (IG), and so on are commonly applied in spam filtering. Spam filtering can also be seen as a special two-class text categorization (TC) problem. Many existing experiments show IG is one of the most effective methods in text categorization task. However, what is the most effective method on spam filtering? As we all know there was not a systematic research about these feature selection methods on spam filtering. This paper is a comparative study of feature selection methods in spam filtering. The focus is on aggressive dimensionality reduction. We explore 2 classifiers (Naïve Bayes and SVM), and run our experiments on Chinese-spam collection. Six methods were evaluated, including term selection based on document frequency (DF), information gain(IG), χ2 feature selection method, expected cross entropy (ECE), the weight of evidence for text (WET) and odds ratio (ODD). We found ODD and WET most effective in our experiments. In contrast, IG and χ2 had relatively poor performance due to their bias towards favoring rare terms, and its sensitivity to probability estimation errors.
Keywords :
belief networks; natural language processing; probability; support vector machines; text analysis; unsolicited e-mail; Chinese spam filtering; Chinese-spam collection; DF; ECE; IG; Naïve Bayes classififer; SVM classifier; TC problem; WET; aggressive dimensionality reduction; automatic feature selection methods; document frequency thresholding; expected cross entropy; information gain; odds ratio; probability estimation errors; special two-class text categorization problem; term selection; weight of evidence for text; Accuracy; Entropy; Filtering; Machine learning; Text categorization; Unsolicited electronic mail; DF; artificial intelligence; feature selection; spam filtering; text classification;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Application of Information and Communication Technologies (AICT), 2012 6th International Conference on
Conference_Location :
Tbilisi
Print_ISBN :
978-1-4673-1739-9
Type :
conf
DOI :
10.1109/ICAICT.2012.6398481
Filename :
6398481
Link To Document :
بازگشت