StORMeD: Stack Overflow Ready Made Data

Author

Ponzanelli, Luca ; Mocci, Andrea ; Lanza, Michele

Author_Institution

REVEAL @ Fac. of Inf., Univ. of Lugano, Lugano, Switzerland

fYear

2015

fDate

16-17 May 2015

Firstpage

474

Lastpage

477

Abstract

Stack Overflow is the de facto Question and Answer (Q&A) website for developers, and it has been used in many approaches by software engineering researchers to mine useful data. However, the contents of a Stack Overflow discussion are inherently heterogeneous, mixing natural language, source code, stack traces and configuration files in XML or JSON format. We constructed a full island grammar capable of modeling the set of 700,000 Stack Overflow discussions talking about Java, building a heterogeneous abstract syntax tree (H-AST) of each post (question, answer or comment) in a discussion. The resulting dataset models every Stack Overflow discussion, providing a full H-AST for each type of structured fragment (i.e., JSON, XML, Java, Stack traces), and complementing this information with a set of basic meta-information like term frequency to enable natural language analyses. Our dataset allows the end-user to perform combined analyses of the Stack Overflow by visiting the H-AST of a discussion.

Keywords

Java; Web sites; XML; question answering (information retrieval); software engineering; JSON format; StORMeD; XML; configuration files; heterogeneous abstract syntax tree; natural language; question and answer Website; software engineering researchers; source code; stack overflow ready made data; stack traces; term frequency; Data mining; Data models; Grammar; Java; Natural languages; Software; XML; h-ast; island parsing; unstructured data;

fLanguage

English

Publisher

ieee

Conference_Titel

Mining Software Repositories (MSR), 2015 IEEE/ACM 12th Working Conference on

Conference_Location

Florence

Type

conf

DOI

10.1109/MSR.2015.67

Filename

7180121