Title :
Examination of data, rule generation and detection of phishing URLs using online logistic regression
Author :
Feroz, Mohammed Nazim ; Mengel, Susan
Author_Institution :
Comput. Sci., Texas Tech Univ., Lubbock, TX, USA
Abstract :
Web services such as online banking, gaming, and social networking have rapidly evolved as has the reliance upon them by people to perform everyday tasks. As a result, a large amount of information is uploaded on a daily basis to the Web. The openness of the Web exposes opportunities for criminals to upload malicious content. Despite extensive research, email based spam filtering techniques are unable to protect other web services. Therefore, a counter measure must be taken that generalizes across web services to protect the user from phishing hosts. The paper describes an approach that classifies URLs automatically based on their lexical and host-based features. The usability of Mahout is demonstrated for such scalable machine learning problems, and online learning is considered over batch learning. The classifier achieves 93-97% accuracy by detecting a large number of phishing hosts, while maintaining a modest false positive rate. The raw data is examined, and the effectiveness of various feature subsets is assessed. The relevance of bigrams is assessed, and strengthened by using the chi-squared and information gain attribute evaluation methods.
Keywords :
Web services; computer crime; learning (artificial intelligence); logistics; regression analysis; Mahout; URL phishing; Web services; World Wide Web; batch learning; bigrams; email based spam filtering; gaming; host-based features; information gain attribute evaluation methods; modest false positive rate; online banking; online learning; online logistic regression; raw data; rule generation; scalable machine learning problems; social networking; Accuracy; Classification algorithms; Feature extraction; IP networks; Support vector machine classification; Training; Uniform resource locators; Attribute Evaluation; Decision Tree; Feature Vector; Rule Generation; Stochastic Gradient Descent;
Conference_Titel :
Big Data (Big Data), 2014 IEEE International Conference on
Conference_Location :
Washington, DC
DOI :
10.1109/BigData.2014.7004239