DocumentCode :
3406920
Title :
Mining source code repositories at massive scale using language modeling
Author :
Allamanis, Miltiadis ; Sutton, Craig
Author_Institution :
Sch. of Inf., Univ. of Edinburgh, Edinburgh, UK
fYear :
2013
fDate :
18-19 May 2013
Firstpage :
207
Lastpage :
216
Abstract :
The tens of thousands of high-quality open source software projects on the Internet raise the exciting possibility of studying software development by finding patterns across truly large source code repositories. This could enable new tools for developing code, encouraging reuse, and navigating large projects. In this paper, we build the first giga-token probabilistic language model of source code, based on 352 million lines of Java. This is 100 times the scale of the pioneering work by Hindle et al. The giga-token model is significantly better at the code suggestion task than previous models. More broadly, our approach provides a new “lens” for analyzing software projects, enabling new complexity metrics based on statistical analysis of large corpora. We call these metrics data-driven complexity metrics. We propose new metrics that measure the complexity of a code module and the topical centrality of a module to a software project. In particular, it is possible to distinguish reusable utility classes from classes that are part of a program´s core logic based solely on general information theoretic criteria.
Keywords :
Java; data mining; project management; software management; software metrics; source coding; statistical analysis; Java; code module complexity; code suggestion task; data-driven complexity metrics; general information theoretic criteria; giga-token probabilistic language model; module topical centrality; programs core logic; reusable utility classes; software project analysis; source code repositories mining; statistical analysis; Complexity theory; Entropy; Java; Measurement; Predictive models; Software; Training;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Mining Software Repositories (MSR), 2013 10th IEEE Working Conference on
Conference_Location :
San Francisco, CA
ISSN :
2160-1852
Print_ISBN :
978-1-4799-0345-0
Type :
conf
DOI :
10.1109/MSR.2013.6624029
Filename :
6624029
Link To Document :
بازگشت