Author_Institution :
Dept. of Comput. Sci. & Inf. Eng., Nat. Taiwan Univ. of Sci. & Technol., Taipei, Taiwan
Abstract :
Malicious document is one of the most notorious components of modern attacks. The document may appear normal in its format, but behave strangely or beyond users´ expectation, sometimes lead to severe consequences when it is opened. Detecting malicious documents tops one of the most important tasks in modern information security. Malicious documents usually contain specific control codes inside which may cause the malicious shell code be executed automatically. The document control code is originally designed to enrich the documents´ functionalities, but in this case, it may create vulnerabilities and then become a key to trigger attacks. Detecting control codes of certain pattern is a key to the success of malicious document detection. Different from previous research that was focused on detecting malicious documents of a particular format or containing specific control codes, we propose a method that analyzes the document objects from three general views: the use of functional words, preference words, and constant data. The functional words control how an attack is launched, and through what actions, if the document is considered a malicious one, the preference words usually suggest the favored word choices from document authors, and the constant data can be considered the bullets to complete the attack. We also propose a TF-IDF method to normalize the features to detect documents with mimicry attacks. Overall, given the three feature views, we detect malicious documents under a classification framework. We evaluate the proposed approach through series of experiments that use different view combinations for prediction, followed by some comparison of the proposed method to related work.
Keywords :
document handling; pattern classification; security of data; TF-IDF method; classification framework; constant data; document control code; functional words; information security; malicious shell code; mimicry attacks; multiview malicious document detection; preference words; Accuracy; Data mining; Detectors; Entropy; Feature extraction; Information security; Portable document format; PDF; exploit; malicious document; multi-view; vulnerability;