DocumentCode
3112382
Title
Building Statistical Language Models of code
Author
Schulam, Peter ; Rosenfeld, Roni ; Devanbu, Premkumar
Author_Institution
Language Technol. Inst., Carnegie Mellon Univ., Pittsburgh, PA, USA
fYear
2013
fDate
21-21 May 2013
Firstpage
1
Lastpage
3
Abstract
We present the Source Code Statistical Language Model data analysis pattern. Statistical language models have been an enabling tool for a wide array of important language technologies. Speech recognition, machine translation, and document summarization (to name a few) all rely on statistical language models to assign probability estimates to natural language utterances or sentences. In this data analysis pattern, we describe the process of building n-gram language models over software source files. We hope that by introducing the empirical software engineering community to best practices that have been established over the years in research for natural languages, statistical language models can become a tool that SE researchers are able to use to explore new research directions.
Keywords
data analysis; natural languages; software engineering; source coding; statistical analysis; building statistical language models; document summarization; empirical software engineering community; machine translation; n-gram language models; natural language sentences; natural language utterances; software source files; source code data analysis pattern; speech recognition; Buildings; Data models; Natural languages; Smoothing methods; Software engineering; Speech recognition; Vocabulary;
fLanguage
English
Publisher
ieee
Conference_Titel
Data Analysis Patterns in Software Engineering (DAPSE), 2013 1st International Workshop on
Conference_Location
San Francisco, CA
Type
conf
DOI
10.1109/DAPSE.2013.6603797
Filename
6603797
Link To Document