• DocumentCode
    2155700
  • Title

    Syntax tree fingerprinting for source code similarity detection

  • Author

    Chilowicz, Michel ; Duris, Etienne ; Roussel, Gilles

  • Author_Institution
    Lab. d´´Inf., Univ. Paris-Est, Marne-la-Vallee
  • fYear
    2009
  • fDate
    17-19 May 2009
  • Firstpage
    243
  • Lastpage
    247
  • Abstract
    Numerous approaches based on metrics, token sequence pattern-matching, abstract syntax tree (AST) or program dependency graph (PDG) analysis have already been proposed to highlight similarities in source code: in this paper we present a simple and scalable architecture based on AST fingerprinting. Thanks to a study of several hashing strategies reducing false-positive collisions, we propose a framework that efficiently indexes AST representations in a database, that quickly detects exact (w.r.t source code abstraction) clone clusters and that easily retrieves their corresponding ASTs. Our aim is to allow further processing of neighboring exact matches in order to identify the larger approximate matches, dealing with the common modification patterns seen in the intra-project copy-pastes and in the plagiarism cases.
  • Keywords
    cryptography; database indexing; pattern matching; program diagnostics; software metrics; tree data structures; abstract syntax tree fingerprinting; clone cluster; database indexing; false-positive collision; hash strategy; intra-project copy-paste; plagiarism; program dependency graph analysis; software metrics; source code abstraction; source code similarity detection; token sequence pattern-matching; Cloning; Databases; Fingerprint recognition; Indexes; Information retrieval; Pattern analysis; Pattern matching; Plagiarism; Scalability; Software maintenance;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Program Comprehension, 2009. ICPC '09. IEEE 17th International Conference on
  • Conference_Location
    Vancouver, BC
  • ISSN
    1092-8138
  • Print_ISBN
    978-1-4244-3998-0
  • Electronic_ISBN
    1092-8138
  • Type

    conf

  • DOI
    10.1109/ICPC.2009.5090050
  • Filename
    5090050