On the Cost of Mining Very Large Open Source Repositories

Author

Banerjee, Sean ; Cukic, Bojan

Author_Institution

Robot. Inst., Carnegie Mellon Univ., Pittsburgh, PA, USA

fYear

2015

fDate

23-23 May 2015

Firstpage

37

Lastpage

43

Abstract

Open source bug tracking systems provide a rich information suite that is actively used by software engineering researchers to design solutions to triaging, duplicate classification and developer assignment problems. Today, open repositories often contain in excess of 100, 000 reports, and in cases of RedHat and Mozilla, over a million. Obtaining and analyzing the contents of such datasets are both time and resource consuming. By summarizing the related work we demonstrate that researchers often focused on smaller subsets of the data, and seldom embrace the “big-dataism”. With the emergence of cloud based computation systems such as Amazon EC2, one expects it to be easier to perform large scale analyses. However, our detailed time and cost analysis indicates that significant challenges still remain. Acquiring the open source data can be time intensive, and prone to being misinterpreted as Denial of Service attacks. Generating similarity scores for all prior reports, for example, is a polynomial time problem. In this paper, we present actual costs that we incurred when analyzing the complete repositories from Eclipse, Firefox and Open Office. In our approach, we relied on computing clusters to process the data in an attempt to reduce the cost of analyzing large datasets on the cloud. We present estimated costs for a researcher attempting to analyze complete datasets from Eclipse, Mozilla, Novell and RedHat using the best possible resources. In an ideal situation, with no bottlenecks, a researcher investing just over $40, 000 and 2 weeks of non stop computing time would be able to measure similarity of problem reports within all four datasets.

Keywords

Big Data; cloud computing; computational complexity; data mining; public domain software; software engineering; Amazon EC2; Big-Dataism; Eclipse; Firefox; Novell; Open Office; RedHat; cloud based computation systems; cost analysis; data processing; denial of service attacks; open source bug tracking systems; polynomial time problem; software engineering; time analysis; very large open source repository mining; Accuracy; Computer crime; Data mining; Graphics processing units; Random access memory; XML;

fLanguage

English

Publisher

ieee

Conference_Titel

Big Data Software Engineering (BIGDSE), 2015 IEEE/ACM 1st International Workshop on

Conference_Location

Florence

Type

conf

DOI

10.1109/BIGDSE.2015.16

Filename

7166057