Title of article :
Using heuristics to estimate an appropriate number of latent topics in source code analysis
Author/Authors :
Scott Grant، نويسنده , , James R. Cordy، نويسنده , , David B. Skillicorn، نويسنده ,
Issue Information :
ماهنامه با شماره پیاپی سال 2013
Pages :
16
From page :
1663
To page :
1678
Abstract :
Latent Dirichlet Allocation (LDA) is a data clustering algorithm that performs especially well for text documents. In natural-language applications it automatically finds groups of related words (called “latent topics”) and clusters the documents into sets that are about the same “topic”. LDA has also been applied to source code, where the documents are natural source code units such as methods or classes, and the words are the keywords, operators, and programmer-defined names in the code. The problem of determining a topic count that most appropriately describes a set of source code documents is an open problem. We address this empirically by constructing clusterings with different numbers of topics for a large number of software systems, and then use a pair of measures based on source code locality and topic model similarity to assess how well the topic structure identifies related source code units. Results suggest that the topic count required can be closely approximated using the number of software code fragments in the system. We extend these results to recommend appropriate topic counts for arbitrary software systems based on an analysis of a set of open source systems.
Keywords :
Source code analysis , Latent Dirichlet Allocation , Code clusters , Latent topic model
Journal title :
Science of Computer Programming
Serial Year :
2013
Journal title :
Science of Computer Programming
Record number :
1080405
Link To Document :
بازگشت