• DocumentCode
    2453215
  • Title

    Ctcompare: Code clone detection using hashed token sequences

  • Author

    Toomey, Warren

  • Author_Institution
    Sch. of IT, Bond Univ., Robina, QLD, Australia
  • fYear
    2012
  • fDate
    4-4 June 2012
  • Firstpage
    92
  • Lastpage
    93
  • Abstract
    There is much research on the use of tokenized source code to find code clones both within and between trees of source code. Some approaches have used suffix trees [1], [3]; others have used variations of longest common substring algorithms [4], [5]. This paper outlines an algorithm, embodied in a new tool called ctcompare, that takes a different tokenization approach. Each code base to be compared is first lexically analysed to produce a sequence of tokens. These are then broken into overlapping tuples of N consecutive tokens. The tuples are then hashed and the hash values of token tuples are used to identify type-1 and type-2 clone pairs. Hashed token sequences combined with a database have already been used in earlier ctcompare versions and elsewhere [2], but with a significant performance penalty due to database insertions. The benefits of this approach over the existing research include the simultaneous comparison of multiple large code bases and fast absolute performance.
  • Keywords
    cryptography; source coding; trees (mathematics); code clone detection; ctcompare; hashed token sequences; suffix trees; tokenized source code; Algorithm design and analysis; Australia; Cloning; Databases; Educational institutions; Redundancy; Time measurement; clone detection; code clone; code redundancy; hash function; software;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Software Clones (IWSC), 2012 6th International Workshop on
  • Conference_Location
    Zurich
  • Print_ISBN
    978-1-4673-1794-8
  • Type

    conf

  • DOI
    10.1109/IWSC.2012.6227881
  • Filename
    6227881