DocumentCode
3009740
Title
GA-based feature subset selection in a spam/non-spam detection system
Author
Behjat, Amir Rajabi ; Mustapha, Aida ; Nezamabadi-pour, Hossein ; Sulaiman, Md Nasir ; Mustapha, Norwati
Author_Institution
Fac. of Comput. Sci. & Inf. Technol., Univ. Putra Malaysia, Serdang, Malaysia
fYear
2012
fDate
3-5 July 2012
Firstpage
675
Lastpage
679
Abstract
Spam has created a significant security problem for computer users everywhere. Spammers take an advantage of defrauds to cover parts of messages that can be used for identification of spam. For instance, a spammer does not need to consume much cost and bandwidth for sending junk mails even more than one hundred emails. On the other hand, from the feature selection perspective, one of the specific problems that decrease accuracy of spam and non-spam emails classification is high data dimensionality. Therefore, the reduction of dimensionality is related to decrease the number of irrelevant features. In this paper, a genetic algorithm (GA) is applied during feature selection in effort to decrease the number of useless features in a collection of high-dimensional email body and subject. Next, a Multi-Layer Perceptron (MLP) is employed to classify features that have been selected by the GA. Using LingSpam benchmark corpora as the dataset, the experimental results showed that a GA feature selector with the MLP classifier does not only decrease the data dimensionality but increase the spam detection rate as compared against other classifiers such as SVM and Naïve Bayes.
Keywords
genetic algorithms; multilayer perceptrons; telecommunication security; unsolicited e-mail; GA feature selector; LingSpam benchmark corpora; MLP classifier; SVM; data dimensionality; email classification; feature selection perspective; feature subset selection; genetic algorithm; high dimensional email body; junk mails; multilayer perceptron; naive Bayes; nonspam detection system; security problem; spam detection rate; Accuracy; Electronic mail; Feature extraction; Genetic algorithms; Support vector machine classification; Training; Feature selection; Genetic algorithm; MLP; Spam detection;
fLanguage
English
Publisher
ieee
Conference_Titel
Computer and Communication Engineering (ICCCE), 2012 International Conference on
Conference_Location
Kuala Lumpur
Print_ISBN
978-1-4673-0478-8
Type
conf
DOI
10.1109/ICCCE.2012.6271302
Filename
6271302
Link To Document