Title :
Large-Scale Inter-System Clone Detection Using Suffix Trees
Author_Institution :
Univ. of Bremen, Bremen, Germany
Abstract :
Detecting license violations of source code requires to compare a suspected system against a very large corpus of source code, for instance, the Debian source distribution. Thus, techniques detecting suspiciously similar code must scale in terms of resources needed. In addition to that, high precision of the detection is necessary because a human needs to inspect the results. The current approaches to address the resource challenge is to create an index for the corpus to which the suspected source code is compared. The index creation, however, is very costly. If the analysis is done only once, it may not be worth the effort. This paper demonstrates how suffix trees can be used to obtain a scalable comparison. Our evaluation shows that this approach is faster than current index-based techniques. In addition to that, this paper proposes a method to improve precision through user feedback and automated data mining.
Keywords :
data mining; law; software maintenance; Debian source distribution; automated data mining; index creation; index-based techniques; large-scale inter-system clone detection; license violations; source code; suffix trees; suspiciously similar code; user feedback; Arrays; Cloning; Detectors; Indexes; Licenses; Search problems; Software; clone detection; code search; license violation detection;
Conference_Titel :
Software Maintenance and Reengineering (CSMR), 2012 16th European Conference on
Conference_Location :
Szeged
Print_ISBN :
978-1-4673-0984-4
DOI :
10.1109/CSMR.2012.37