DocumentCode
2453215
Title
Ctcompare: Code clone detection using hashed token sequences
Author
Toomey, Warren
Author_Institution
Sch. of IT, Bond Univ., Robina, QLD, Australia
fYear
2012
fDate
4-4 June 2012
Firstpage
92
Lastpage
93
Abstract
There is much research on the use of tokenized source code to find code clones both within and between trees of source code. Some approaches have used suffix trees [1], [3]; others have used variations of longest common substring algorithms [4], [5]. This paper outlines an algorithm, embodied in a new tool called ctcompare, that takes a different tokenization approach. Each code base to be compared is first lexically analysed to produce a sequence of tokens. These are then broken into overlapping tuples of N consecutive tokens. The tuples are then hashed and the hash values of token tuples are used to identify type-1 and type-2 clone pairs. Hashed token sequences combined with a database have already been used in earlier ctcompare versions and elsewhere [2], but with a significant performance penalty due to database insertions. The benefits of this approach over the existing research include the simultaneous comparison of multiple large code bases and fast absolute performance.
Keywords
cryptography; source coding; trees (mathematics); code clone detection; ctcompare; hashed token sequences; suffix trees; tokenized source code; Algorithm design and analysis; Australia; Cloning; Databases; Educational institutions; Redundancy; Time measurement; clone detection; code clone; code redundancy; hash function; software;
fLanguage
English
Publisher
ieee
Conference_Titel
Software Clones (IWSC), 2012 6th International Workshop on
Conference_Location
Zurich
Print_ISBN
978-1-4673-1794-8
Type
conf
DOI
10.1109/IWSC.2012.6227881
Filename
6227881
Link To Document