• DocumentCode
    2569423
  • Title

    Estimating the Optimal Number of Latent Concepts in Source Code Analysis

  • Author

    Grant, Scott ; Cordy, James R.

  • Author_Institution
    Sch. of Comput., Queen´´s Univ., Kingston, ON, Canada
  • fYear
    2010
  • fDate
    12-13 Sept. 2010
  • Firstpage
    65
  • Lastpage
    74
  • Abstract
    The optimal number of latent topics required to model the most accurate latent substructure for a source code corpus is an open question in source code analysis. Most estimates about the number of latent topics that exist in a software corpus are based on the assumption that the data is similar to natural language, but there is little empirical evidence to support this. In order to help determine the appropriate number of topics needed to accurately represent the source code, we generate a series of Latent Dirichlet Allocation models with varying topic counts. We use a heuristic to evaluate the ability of the model to identify related source code blocks, and demonstrate the consequences of choosing too few or too many latent topics.
  • Keywords
    program diagnostics; statistical analysis; latent Dirichlet allocation models; latent concepts; latent substructure; latent topics; software corpus; source code analysis; source code blocks; source code corpus; Biological system modeling; Cloning; Data models; Information retrieval; Measurement; Natural languages; Semantics; concept location; latent dirichlet allocation; latent topic model; source code analysis;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Source Code Analysis and Manipulation (SCAM), 2010 10th IEEE Working Conference on
  • Conference_Location
    Timisoara
  • Print_ISBN
    978-1-4244-8655-7
  • Type

    conf

  • DOI
    10.1109/SCAM.2010.22
  • Filename
    5601828