Title :
Spam Feature Selection Based on the Improved Mutual Information Algorithm
Author :
Liang Ting ; Yu Qingsong
Author_Institution :
Comput. Center, East China Normal Univ., Shanghai, China
Abstract :
Content-based spam filtering technologies generally use feature selection algorithm for mail classification. Based on the mutual information feature selection algorithm, this paper proposes an improved mutual information method with frequency (MIf) by introducing the word frequency factor, and an improved mutual information method with average frequency (MIaf) by introducing the word average frequency factor. Simulation experiments are conducted based on the English corpus (PU1´s lemm_stop) and Chinese corpus CCERT email data set, the feature subsets are extracted through the improved algorithms, and the mails are classified by the Naïve Bayes algorithm. The experimental results show that the improved mutual information algorithms can select better feature subsets and enhance the mail classification effects.
Keywords :
Bayes methods; content-based retrieval; e-mail filters; feature extraction; unsolicited e-mail; Chinese corpus CCERT email data set; English corpus; Naive Bayes algorithm; content-based spam filtering technologies; feature subsets; improved mutual information algorithm; mail classification effects; method with frequency; mutual information feature selection algorithm; spam feature selection; word average frequency factor; Algorithm design and analysis; Classification algorithms; Educational institutions; Filtering; Mutual information; Postal services; Probability; Spam; Mutual Information; Word Frequencies; Feature Selection;
Conference_Titel :
Multimedia Information Networking and Security (MINES), 2012 Fourth International Conference on
Conference_Location :
Nanjing
Print_ISBN :
978-1-4673-3093-0
DOI :
10.1109/MINES.2012.203