DocumentCode
731530
Title
StORMeD: Stack Overflow Ready Made Data
Author
Ponzanelli, Luca ; Mocci, Andrea ; Lanza, Michele
Author_Institution
REVEAL @ Fac. of Inf., Univ. of Lugano, Lugano, Switzerland
fYear
2015
fDate
16-17 May 2015
Firstpage
474
Lastpage
477
Abstract
Stack Overflow is the de facto Question and Answer (Q&A) website for developers, and it has been used in many approaches by software engineering researchers to mine useful data. However, the contents of a Stack Overflow discussion are inherently heterogeneous, mixing natural language, source code, stack traces and configuration files in XML or JSON format. We constructed a full island grammar capable of modeling the set of 700,000 Stack Overflow discussions talking about Java, building a heterogeneous abstract syntax tree (H-AST) of each post (question, answer or comment) in a discussion. The resulting dataset models every Stack Overflow discussion, providing a full H-AST for each type of structured fragment (i.e., JSON, XML, Java, Stack traces), and complementing this information with a set of basic meta-information like term frequency to enable natural language analyses. Our dataset allows the end-user to perform combined analyses of the Stack Overflow by visiting the H-AST of a discussion.
Keywords
Java; Web sites; XML; question answering (information retrieval); software engineering; JSON format; StORMeD; XML; configuration files; heterogeneous abstract syntax tree; natural language; question and answer Website; software engineering researchers; source code; stack overflow ready made data; stack traces; term frequency; Data mining; Data models; Grammar; Java; Natural languages; Software; XML; h-ast; island parsing; unstructured data;
fLanguage
English
Publisher
ieee
Conference_Titel
Mining Software Repositories (MSR), 2015 IEEE/ACM 12th Working Conference on
Conference_Location
Florence
Type
conf
DOI
10.1109/MSR.2015.67
Filename
7180121
Link To Document