Title :
A Chinese Anti-Spam Filter Approach Based on Support Vector Machine
Author :
Xiu-li, PANG ; Yu-qiang, Feng ; Wei, Jiang
Author_Institution :
Harbin Inst. of Technol., Harbin
Abstract :
This paper presents an anti-spam filter approach based on support vector machine (SVM). Firstly, we adopt the tri-gram language model to perform word segmentation in the Chinese email. In order to overcome the sparse data problem, the absolute discount smoothing algorithm is applied. Secondly, the different factoid words are identified by the automaton machine, so as to acquire the approximate syntactic and semantic usage of factoid words in the anti-spam filter task. Thirdly, we apply Support Vector Machine to filter the spam, where the emails are permitted to be written by the cross language, including Chinese and English. The experiments in the large-scale corpora with the cross language show that the SVM can improve the generalization than the Naive Bayes (Smoothed by Lidstone algorithm) by 4.09% precision, and 8.18% higher precision than the maximum entropy model.
Keywords :
Bayes methods; information filters; maximum entropy methods; support vector machines; unsolicited e-mail; Chinese anti-spam filter approach; Naive Bayes; SVM; absolute discount smoothing algorithm; automaton machine; factoid words usage; large-scale corpora; maximum entropy model; sparse data problem; support vector machine; tri-gram language model; word segmentation; Boosting; Electronic mail; Entropy; Filtering; Filters; Machine learning; Natural languages; Support vector machine classification; Support vector machines; Technology management; Naive Bayes; anti-spam filter; maximum entropy; support vector machine;
Conference_Titel :
Management Science and Engineering, 2007. ICMSE 2007. International Conference on
Conference_Location :
Harbin
Print_ISBN :
978-7-88358-080-5
Electronic_ISBN :
978-7-88358-080-5
DOI :
10.1109/ICMSE.2007.4421831