DocumentCode
2744883
Title
DECKARD: Scalable and Accurate Tree-Based Detection of Code Clones
Author
Jiang, Lingxiao ; Misherghi, Ghassan ; Su, Zhendong ; Glondu, Stéphane
Author_Institution
Univ. of California, Davis, CA
fYear
2007
fDate
20-26 May 2007
Firstpage
96
Lastpage
105
Abstract
Detecting code clones has many software engineering applications. Existing approaches either do not scale to large code bases or are not robust against minor code modifications. In this paper, we present an efficient algorithm for identifying similar subtrees and apply it to tree representations of source code. Our algorithm is based on a novel characterization of subtrees with numerical vectors in the Euclidean space Rnmiddot and an efficient algorithm to cluster these vectors w.r.t. the Euclidean distance metric. Subtrees with vectors in one cluster are considered similar. We have implemented our tree similarity algorithm as a clone detection tool called DECKARD and evaluated it on large code bases written in C and Java including the Linux kernel and JDK. Our experiments show that DECKARD is both scalable and accurate. It is also language independent, applicable to any language with a formally specified grammar.
Keywords
software engineering; trees (mathematics); Deckard; Euclidean distance metric; code clones; software engineering; source code; subtrees; tree representations; tree-based detection; Application software; Cloning; Clustering algorithms; Euclidean distance; Fingerprint recognition; Java; Linux; Programming profession; Robustness; Software engineering;
fLanguage
English
Publisher
ieee
Conference_Titel
Software Engineering, 2007. ICSE 2007. 29th International Conference on
Conference_Location
Minneapolis, MN
ISSN
0270-5257
Print_ISBN
0-7695-2828-7
Type
conf
DOI
10.1109/ICSE.2007.30
Filename
4222572
Link To Document