Title :
Adversarial Spam Detection Using the Randomized Hough Transform-Support Vector Machine
Author :
Debarr, Dave ; Hao Sun ; Wechsler, Harry
Author_Institution :
Comput. Sci. Dept., George Mason Univ., Fairfax, VA, USA
Abstract :
In public e-mail systems, it is possible to solicit annotation help from users to train spam detection models. For example, we can occasionally ask a selected user to annotate whether a randomly selected message destined for their inbox is spam or not spam. Unfortunately, it is also possible that the user being solicited is an internal threat and has malicious intent. Similar to an adversary, such a user may want to introduce noise: to confuse the spam classifier into believing a spam message is not spam (to ensure delivery of similar messages), or to confuse the spam classifier into believing a non-spam message is spam (to prevent delivery of similar messages). Inspired by the Randomized Hough Transform (RHT), a set of Support Vector Machines (SVMs) is trained from randomly chosen data subsets to vote to identify training examples that have been mislabeled. The labels for messages which on the average appear on the wrong side of the decision boundary are flipped and a final SVM model is trained using the modified labels. Two data sets are used for evaluating the proposed RHT-SVM method: the TREC 2007 Spam Track data and the CEAS 2008 Spam data. To preserve the time ordered nature of the data stream, for each of the data sets, the first 10% of the messages are used for training, and the remaining 90% of the messages are used for evaluation. Separate adversarial experiments are conducted for flipping spam labels and non-spam labels. For 10 iterations, labels are flipped for a randomly selected subset of 5% of the training data and the final RHT-SVM is evaluated on the test set. Performance of the RHT-SVM is compared to the performance of the state of the art Reject On Negative Impact (RONI) algorithm. RHT-SVM shows an average 9.3% increase in the F measure compared to RONI (99.0% versus 90.6%), as well as significant improvements in other evaluation metrics. The flip sensitivity for RHT-SVM is 95.9% and the flip specificity is 99.0%. It also takes over 90% less time- to complete the RHT-SVM experiments compared to the RONI experiments (20 minutes per experiment instead of 360 minutes).
Keywords :
Hough transforms; pattern classification; security of data; support vector machines; unsolicited e-mail; CEAS 2008 spam data; RHT-SVM; RONI; TREC 2007 spam track data; adversarial spam detection; internal threat; malicious intent; nonspam labels; nonspam message; public e-mail systems; randomized Hough transform-support vector machine; randomly selected message; reject on negative impact algorithm; spam classifier; spam labels; Kernel; Noise; Support vector machines; Training; Training data; Transforms; Unsolicited electronic mail; Adversarial Label Noise; Adversarial Learning; Spam Detection; Support Vector Machines;
Conference_Titel :
Machine Learning and Applications (ICMLA), 2013 12th International Conference on
Conference_Location :
Miami, FL
DOI :
10.1109/ICMLA.2013.61