DocumentCode :
1782923
Title :
The investigation on the effect of feature vector dimension for spam email detection with a new framework
Author :
Ergin, Semih ; Isik, Sinan
Author_Institution :
Dept. of Electr. & Electron. Eng., Eskisehir Osmangazi Univ., Eskisehir, Turkey
fYear :
2014
fDate :
18-21 June 2014
Firstpage :
1
Lastpage :
4
Abstract :
In this study, the effect of dimension for a feature vector on the classification of Turkish e-mails as spam or legitimate is investigated. Although hundreds of experimental studies are achieved especially for English, which is a non-agglutinative language, the number of efforts for Turkish, which is one of the most popular agglutinative languages in the world, is counted something on the fingers of one hand. Therefore, a solution is sought for Turkish spam e-mail problem taking the special characteristics of Turkish e-mails into consideration. The developed spam filtering framework has four components named as morphological decomposition, feature selection, training, and test phases. A fixed-prefix stemming approach is used to extract the features of an e-mail and then the Mutual Information (MI) method is carried out as the feature selection method. The Decision Tree (DT) and Artificial Neural Network (ANN) classifiers are employed and the recognition accuracies obtained from these methods are considerably satisfactory. The highest accuracy rates are 91.08% for ANN and 87.67% for DT methods when the dimensions of feature vectors are selected as 150×5) and (75×5), respectively.
Keywords :
decision trees; neural nets; pattern classification; unsolicited e-mail; ANN classifier; DTclassifier; English language; MI method; Turkish e-mail classification; Turkish language; artificial neural network; decision tree; electronic mail; feature selection phase; feature vector dimension; fixed-prefix stemming approach; morphological decomposition phase; mutual information; spam email detection; spam filtering framework; test phase; training phase; Accuracy; Artificial neural networks; Feature extraction; Support vector machine classification; Text categorization; Unsolicited electronic mail; Spam; artificial neural networks; decision tree; e-mail; legitimate; mutual information;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Information Systems and Technologies (CISTI), 2014 9th Iberian Conference on
Conference_Location :
Barcelona
Type :
conf
DOI :
10.1109/CISTI.2014.6877092
Filename :
6877092
Link To Document :
بازگشت