• DocumentCode
    3112382
  • Title

    Building Statistical Language Models of code

  • Author

    Schulam, Peter ; Rosenfeld, Roni ; Devanbu, Premkumar

  • Author_Institution
    Language Technol. Inst., Carnegie Mellon Univ., Pittsburgh, PA, USA
  • fYear
    2013
  • fDate
    21-21 May 2013
  • Firstpage
    1
  • Lastpage
    3
  • Abstract
    We present the Source Code Statistical Language Model data analysis pattern. Statistical language models have been an enabling tool for a wide array of important language technologies. Speech recognition, machine translation, and document summarization (to name a few) all rely on statistical language models to assign probability estimates to natural language utterances or sentences. In this data analysis pattern, we describe the process of building n-gram language models over software source files. We hope that by introducing the empirical software engineering community to best practices that have been established over the years in research for natural languages, statistical language models can become a tool that SE researchers are able to use to explore new research directions.
  • Keywords
    data analysis; natural languages; software engineering; source coding; statistical analysis; building statistical language models; document summarization; empirical software engineering community; machine translation; n-gram language models; natural language sentences; natural language utterances; software source files; source code data analysis pattern; speech recognition; Buildings; Data models; Natural languages; Smoothing methods; Software engineering; Speech recognition; Vocabulary;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Data Analysis Patterns in Software Engineering (DAPSE), 2013 1st International Workshop on
  • Conference_Location
    San Francisco, CA
  • Type

    conf

  • DOI
    10.1109/DAPSE.2013.6603797
  • Filename
    6603797