DocumentCode :
2247630
Title :
Online supervised learning from multi-field documents for email spam filtering
Author :
Liu, Wu-ying ; Wang, Lin ; Wang, Ting
Author_Institution :
Coll. of Comput., Nat. Univ. of Defense Technol., Changsha, China
Volume :
6
fYear :
2010
fDate :
11-14 July 2010
Firstpage :
3335
Lastpage :
3340
Abstract :
Email spam filtering is considered as an online supervised learning task for binary text classification (TC). Normally, the previous statistical TC algorithms treat an email as a single plain-text document, ignoring the multi-field feature of email documents. This paper investigates the multi-field feature, and proposes a multi-field learning (MFL) approach for email spam filtering. The MFL approach divides the complex TC problem of multi-field document into several sub-problems, and conquers each sub-problem separately. At online learning, multi-scorer is learned separately within its text field according to online supervised feedbacks. At online predicting, multi-scorer´s output scores are combined to predict the new document´s category. The MFL framework is a general frame to combine scorers implemented by any statistical TC algorithms. However, previous TC algorithms often require great training or updating time, which are impractical for large-scale email systems. Considering the space-time spending of email spam filtering, a string-frequency index (SFI) binary TC algorithm is proposed, which is based on the straightforward conditional probability and has low space-time complexity for both online learning and online predicting. The experimental results on TREC spam track show that the performances of online Bayesian and relaxed online SVMs algorithms can be improved by the MFL approach. Especially, the proposed SFI algorithm can achieve the state-of-the-art performance at greatly reduced computational cost within the MFL framework.
Keywords :
e-mail filters; learning (artificial intelligence); unsolicited e-mail; binary text classification; conditional probability; email spam filtering; multi-field documents; multi-field learning; online supervised learning; string-frequency index; Electronic mail; Feature extraction; Filtering; Filtering algorithms; Prediction algorithms; Silicon; Training; Email Spam Filtering; Multi-field Document; Multi-field Learning; Online Supervised Learning; Text Classification;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Machine Learning and Cybernetics (ICMLC), 2010 International Conference on
Conference_Location :
Qingdao
Print_ISBN :
978-1-4244-6526-2
Type :
conf
DOI :
10.1109/ICMLC.2010.5580676
Filename :
5580676
Link To Document :
بازگشت