• DocumentCode
    492590
  • Title

    Scalable detection of semantic clones

  • Author

    Gabel, Mark ; Jiang, Lingxiao ; Su, Zhendong

  • Author_Institution
    Univ. of California, Davis, Davis, CA
  • fYear
    2008
  • fDate
    10-18 May 2008
  • Firstpage
    321
  • Lastpage
    330
  • Abstract
    Several techniques have been developed for identifying similar code fragments in programs. These similar fragments, referred to as code clones, can be used to identify redundant code, locate bugs, or gain insight into program design. Existing scalable approaches to clone detection are limited to finding program fragments that are similar only in their contiguous syntax. Other, semantics-based approaches are more resilient to differences in syntax, such as reordered statements, related statements interleaved with other unrelated statements, or the use of semantically equivalent control structures. However, none of these techniques have scaled to real world code bases. These approaches capture semantic information from Program Dependence Graphs (PDGs), program representations that encode data and control dependencies between statements and predicates. Our definition of a code clone is also based on this representation: we consider program fragments with isomorphic PDGs to be clones. In this paper, we present the first scalable clone detection algorithm based on this definition of semantic clones. Our insight is the reduction of the difficult graph similarity problem to a simpler tree similarity problem by mapping carefully selected PDG subgraphs to their related structured syntax. We efficiently solve the tree similarity problem to create a scalable analysis. We have implemented this algorithm in a practical tool and performed evaluations on several million-line open source projects, including the Linux kernel. Compared with previous approaches, our tool locates significantly more clones, which are often more semantically interesting than simple copied and pasted code fragments.
  • Keywords
    Linux; data flow graphs; program compilers; program control structures; program debugging; trees (mathematics); Linux kernel; PDG subgraphs; bugs location; code clones; contiguous syntax; graph similarity problem; isomorphic PDG; million-line open source projects; program code fragments; program dependence graphs; program design; program fragments; program representations; real world code bases; redundant code; reordered statements; scalable clone detection algorithm; scalable detection; semantic clones; semantic information; semantically equivalent control structures; structured syntax; tree similarity problem; Cloning; Computer bugs; Computer science; Detection algorithms; Kernel; Linux; Performance evaluation; Software algorithms; Software maintenance; Tree graphs; clone detection; program dependence graph; refactoring; software maintenance;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Software Engineering, 2008. ICSE '08. ACM/IEEE 30th International Conference on
  • Conference_Location
    Leipzig
  • ISSN
    0270-5257
  • Print_ISBN
    978-1-4244-4486-1
  • Electronic_ISBN
    0270-5257
  • Type

    conf

  • DOI
    10.1145/1368088.1368132
  • Filename
    4814143