مرکز منطقه ای اطلاع رساني علوم و فناوري - Phishing detection using traffic behavior, spectral clustering, and random forests

DocumentCode :

2882207

Title :

Phishing detection using traffic behavior, spectral clustering, and random forests

Author :

Debarr, Dave ; Ramanathan, Vignesh ; Wechsler, Harry

Author_Institution :

Comput. Sci. Dept., George Mason Univ., Fairfax, VA, USA

fYear :

2013

fDate :

4-7 June 2013

Firstpage :

Lastpage :

Abstract :

Phishing is an attempt to steal a user´s identity. This is typically accomplished by sending an email message to a user, with a link directing the user to a web site used to collect personal information. Phishing detection systems typically rely on content filtering techniques, such as Latent Dirichlet Allocation (LDA), to identify phishing messages. In the case of spear phishing, however, this may be ineffective because messages from a trusted source may contain little content. In order to handle such emerging spear phishing behavior, we propose as a first step the use of Spectral Clustering to analyze messages based on traffic behavior. In particular, Spectral Clustering analyzes the links between URL substrings for web sites found in the message contents. Cluster membership is then used to construct a Random Forest classifier for phishing. Data from the Phishing Email Corpus and the Spam Assassin Email Corpus are used to evaluate this approach. Performance evaluation metrics include the Area Under the receiver operating characteristic Curve (AUC), as well as accuracy, precision, recall, and the (harmonic mean) F measure. Performance of the integrated Spectral Clustering and Random Forest approach is found to provide significant improvements in all the metrics listed, compared to a content filtering technique such as LDA coupled with text message deletion done randomly or in an adaptive fashion using adversarial learning. The Spectral Clustering approach is robust against the absence of content. In particular, we show that Spectral Clustering yields (99.8%, 97.8%) for (AUC, F measure) compared to LDA that yields (94.6%, 89.4%) and (79.6%, 57.9%) when the content of the messages is reduced to 10% of their original size using random and adversarial deletion, respectively. The difference is most striking at low False Positive (FP) rates.

Keywords :

Web sites; computer crime; learning (artificial intelligence); pattern classification; pattern clustering; performance evaluation; random processes; unsolicited e-mail; AUC; URL substrings; Web site; adversarial deletion; adversarial learning; area under the receiver operating characteristic curve; cluster membership; email message; false positive rates; integrated spectral clustering; message contents; performance evaluation metrics; personal information collection; phishing detection systems; phishing email corpus; random deletion; random forest classifier; spam assassin email corpus; spear phishing behavior; text message deletion; traffic behavior; trusted source; Electronic mail; Laplace equations; Training; Vegetation; Web servers; Web sites; Latent Dirichlet Allocation; Link Analysis; Phishing; Spear Phishing; Spectral Clustering;

fLanguage :

English

Publisher :

ieee

Conference_Titel :

Intelligence and Security Informatics (ISI), 2013 IEEE International Conference on

Conference_Location :

Seattle, WA

Print_ISBN :

978-1-4673-6214-6

Type :

conf

DOI :

10.1109/ISI.2013.6578788

Filename :

6578788

Link To Document :

https://search.ricest.ac.ir/dl/search/defaultta.aspx?DTC=49&DC=2882207