• DocumentCode
    3126421
  • Title

    Modeling High-Level Behavior Patterns for Precise Similarity Analysis of Software

  • Author

    Kwon, Taeho ; Su, Zhendong

  • Author_Institution
    Dept. of Comput. Sci., Univ. of California, Davis, CA, USA
  • fYear
    2011
  • fDate
    11-14 Dec. 2011
  • Firstpage
    1134
  • Lastpage
    1139
  • Abstract
    The analysis of software similarity has many applications such as detecting code clones, software plagiarism, code theft, and polymorphic malware. Because often source code is unavailable and code obfuscation is used to avoid detection, there has been much research on developing effective models to capture runtime behavior to aid detection. Existing models focus on low-level information such as dependency or purely occurrence of function calls, and suffer from poor precision, poor scalability, or both. To overcome limitations of existing models, this paper introduces a precise and succinct behavior representation that characterizes high-level object-accessing patterns as regular expressions. We first distill a set of high-level patterns (the alphabet S of the regular language) based on two pieces of information: function call patterns to access objects and type state information of the objects. Then we abstract a runtime trace of a program P into a regular expression e over the pattern alphabet S to produce P´s behavior signature. We show that software instances derived from the same code exhibit similar behavior signatures and develop effective algorithms to cluster and match behavior signatures. To evaluate the effectiveness of our behavior model, we have applied it to the similarity analysis of polymorphic malware. Our results on a large malware collection demonstrate that our model is both precise and succinct for effective and scalable matching and detection of polymorphic malware.
  • Keywords
    industrial property; invasive software; pattern matching; program diagnostics; behavior signature; code clone detection; code obfuscation; function call pattern; high-level object-accessing pattern modeling; polymorphic malware; poor precision; poor scalability; precise behavior representation; program runtime tracing; runtime behavior; software plagiarism; software precise similarity analysis; source code theft; succinct behavior representation; Algorithm design and analysis; Analytical models; Clustering algorithms; Malware; Measurement; Software; Software algorithms; malware analysis and clustering; sequence clustering; software behavior model;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Data Mining (ICDM), 2011 IEEE 11th International Conference on
  • Conference_Location
    Vancouver,BC
  • ISSN
    1550-4786
  • Print_ISBN
    978-1-4577-2075-8
  • Type

    conf

  • DOI
    10.1109/ICDM.2011.104
  • Filename
    6137327