Title :
Modeling High-Level Behavior Patterns for Precise Similarity Analysis of Software
Author :
Kwon, Taeho ; Su, Zhendong
Author_Institution :
Dept. of Comput. Sci., Univ. of California, Davis, CA, USA
Abstract :
The analysis of software similarity has many applications such as detecting code clones, software plagiarism, code theft, and polymorphic malware. Because often source code is unavailable and code obfuscation is used to avoid detection, there has been much research on developing effective models to capture runtime behavior to aid detection. Existing models focus on low-level information such as dependency or purely occurrence of function calls, and suffer from poor precision, poor scalability, or both. To overcome limitations of existing models, this paper introduces a precise and succinct behavior representation that characterizes high-level object-accessing patterns as regular expressions. We first distill a set of high-level patterns (the alphabet S of the regular language) based on two pieces of information: function call patterns to access objects and type state information of the objects. Then we abstract a runtime trace of a program P into a regular expression e over the pattern alphabet S to produce P´s behavior signature. We show that software instances derived from the same code exhibit similar behavior signatures and develop effective algorithms to cluster and match behavior signatures. To evaluate the effectiveness of our behavior model, we have applied it to the similarity analysis of polymorphic malware. Our results on a large malware collection demonstrate that our model is both precise and succinct for effective and scalable matching and detection of polymorphic malware.
Keywords :
industrial property; invasive software; pattern matching; program diagnostics; behavior signature; code clone detection; code obfuscation; function call pattern; high-level object-accessing pattern modeling; polymorphic malware; poor precision; poor scalability; precise behavior representation; program runtime tracing; runtime behavior; software plagiarism; software precise similarity analysis; source code theft; succinct behavior representation; Algorithm design and analysis; Analytical models; Clustering algorithms; Malware; Measurement; Software; Software algorithms; malware analysis and clustering; sequence clustering; software behavior model;
Conference_Titel :
Data Mining (ICDM), 2011 IEEE 11th International Conference on
Conference_Location :
Vancouver,BC
Print_ISBN :
978-1-4577-2075-8
DOI :
10.1109/ICDM.2011.104