A Practical Guide for Detecting the Java Script-Based Malware Using Hidden Markov Models and Linear Classifiers

Author

Cosovan, Doina ; Benchea, Razvan ; Gavrilut, Dragos

Author_Institution

Romania Bitdefender Anti-virus Res. Lab., Al.I. Cuza Univ. of Iasi, Iasi, Romania

fYear

2014

fDate

22-25 Sept. 2014

Firstpage

236

Lastpage

243

Abstract

The World Wide Web evolved so rapidly that it is no longer considered a luxury, but a necessity. That is why currently the most popular infection vectors used by cyber criminals are either web pages or commonly used documents (such as pdf files). In both of these cases, the malicious actions performed are written in Java Script. Because of this, Java Script has become the preferred language for spreading malware. In order to be able to stop malicious content from executing, detection of its infection vector is crucial. In this paper we propose various methods for detecting Java Script-based attack vectors. For achieving our goal we first need to fight metamorphism techniques usually used in Java Script malicious code, which are by no means trivial: garbage instruction insertion, variable renaming, equivalent instruction substitution, function permutation, instruction reordering, and so on. Our approach to deal with metamorphism starts with splitting the Java Script content in components and filtering the insignificant ones. We then use a data set, consisting in over one million Java Script files in order to test several machine learning algorithms such as Hidden Markov Models, linear classifiers and hybrid approaches for malware detection. Finally, we analyze these detection methods from a practical point of view, emphasizing the need for a very low false positive rate and the ability to be trained on large datasets.

Keywords

Java; Web sites; hidden Markov models; invasive software; learning (artificial intelligence); pattern classification; vectors; JavaScript content; JavaScript files; JavaScript malicious code; JavaScript-Based malware detection; JavaScript-based attack vector detection; Web pages; World Wide Web; cybercriminals; equivalent instruction substitution; function permutation; garbage instruction insertion; hidden Markov models; infection vectors; instruction reordering; linear classifiers; machine learning algorithms; metamorphism techniques; variable renaming; Feature extraction; HTML; Hidden Markov models; Malware; Portable document format; Reactive power; Vectors; Hidden Markov Model; Java Script; Linear Classifier; Machine Learning; PDF; detection; infection vector; malware; metamorphism;

fLanguage

English

Publisher

ieee

Conference_Titel

Symbolic and Numeric Algorithms for Scientific Computing (SYNASC), 2014 16th International Symposium on

Conference_Location

Timisoara

Print_ISBN

978-1-4799-8447-3

Type

conf

DOI

10.1109/SYNASC.2014.39

Filename

7034689