Ctcompare: Code clone detection using hashed token sequences

Author

Toomey, Warren

Author_Institution

Sch. of IT, Bond Univ., Robina, QLD, Australia

fYear

2012

fDate

4-4 June 2012

Firstpage

92

Lastpage

93

Abstract

There is much research on the use of tokenized source code to find code clones both within and between trees of source code. Some approaches have used suffix trees [1], [3]; others have used variations of longest common substring algorithms [4], [5]. This paper outlines an algorithm, embodied in a new tool called ctcompare, that takes a different tokenization approach. Each code base to be compared is first lexically analysed to produce a sequence of tokens. These are then broken into overlapping tuples of N consecutive tokens. The tuples are then hashed and the hash values of token tuples are used to identify type-1 and type-2 clone pairs. Hashed token sequences combined with a database have already been used in earlier ctcompare versions and elsewhere [2], but with a significant performance penalty due to database insertions. The benefits of this approach over the existing research include the simultaneous comparison of multiple large code bases and fast absolute performance.

Keywords

cryptography; source coding; trees (mathematics); code clone detection; ctcompare; hashed token sequences; suffix trees; tokenized source code; Algorithm design and analysis; Australia; Cloning; Databases; Educational institutions; Redundancy; Time measurement; clone detection; code clone; code redundancy; hash function; software;

fLanguage

English

Publisher

ieee

Conference_Titel

Software Clones (IWSC), 2012 6th International Workshop on

Conference_Location

Zurich

Print_ISBN

978-1-4673-1794-8

Type

conf

DOI

10.1109/IWSC.2012.6227881

Filename

6227881