Title :
StORMeD: Stack Overflow Ready Made Data
Author :
Ponzanelli, Luca ; Mocci, Andrea ; Lanza, Michele
Author_Institution :
REVEAL @ Fac. of Inf., Univ. of Lugano, Lugano, Switzerland
Abstract :
Stack Overflow is the de facto Question and Answer (Q&A) website for developers, and it has been used in many approaches by software engineering researchers to mine useful data. However, the contents of a Stack Overflow discussion are inherently heterogeneous, mixing natural language, source code, stack traces and configuration files in XML or JSON format. We constructed a full island grammar capable of modeling the set of 700,000 Stack Overflow discussions talking about Java, building a heterogeneous abstract syntax tree (H-AST) of each post (question, answer or comment) in a discussion. The resulting dataset models every Stack Overflow discussion, providing a full H-AST for each type of structured fragment (i.e., JSON, XML, Java, Stack traces), and complementing this information with a set of basic meta-information like term frequency to enable natural language analyses. Our dataset allows the end-user to perform combined analyses of the Stack Overflow by visiting the H-AST of a discussion.
Keywords :
Java; Web sites; XML; question answering (information retrieval); software engineering; JSON format; StORMeD; XML; configuration files; heterogeneous abstract syntax tree; natural language; question and answer Website; software engineering researchers; source code; stack overflow ready made data; stack traces; term frequency; Data mining; Data models; Grammar; Java; Natural languages; Software; XML; h-ast; island parsing; unstructured data;
Conference_Titel :
Mining Software Repositories (MSR), 2015 IEEE/ACM 12th Working Conference on
Conference_Location :
Florence
DOI :
10.1109/MSR.2015.67