DocumentCode :
1478340
Title :
Speed Up Statistical Spam Filter by Approximation
Author :
Zhong, Zhenyu ; Li, Kang
Author_Institution :
McAfee Inc., Alpharetta, GA, USA
Volume :
60
Issue :
1
fYear :
2011
Firstpage :
120
Lastpage :
134
Abstract :
Statistical-based Bayesian filters have become a popular and important defense against spam. However, despite their effectiveness, their greater processing overhead can prevent them from scaling well for enterprise level mail servers. For example, the dictionary lookups that are characteristic of this approach are limited by the memory access rate, therefore relatively insensitive to increases in CPU speed. We conduct a comprehensive study to address this scaling issue by proposing a series of acceleration techniques that speed up Bayesian filters based on approximate classifications. The core approximation technique uses hash-based lookup and lossy encoding. Lookup approximation is based on the popular Bloom filter data structure with an extension to support value retrieval. Lossy encoding is used to further compress the data structure. While these approximation methods introduce additional errors to a strict Bayesian approach, we show how the errors can be both minimized and biased toward a false negative classification. We demonstrate a 6× speedup over two well-known spam filters (bogofilter and qsf) while achieving an identical false positive rate and similar false negative rate to the original filters.
Keywords :
Bayes methods; data structures; pattern classification; statistical analysis; unsolicited e-mail; Bloom filter data structure; approximate classifications; dictionary lookups; enterprise level mail servers; hash based lookup; lossy encoding; statistical based Bayesian filters; statistical spam filter; value retrieval; Acceleration; Approximation methods; Bayesian methods; Data structures; Dictionaries; Electronic mail; Encoding; Filters; Information retrieval; Unsolicited electronic mail; Computer systems organization; approximation.; bloom filter; information systems applications; miscellaneous; performance attributes; spam;
fLanguage :
English
Journal_Title :
Computers, IEEE Transactions on
Publisher :
ieee
ISSN :
0018-9340
Type :
jour
DOI :
10.1109/TC.2010.92
Filename :
5453351
Link To Document :
بازگشت