DocumentCode
3275728
Title
On the Cost of Mining Very Large Open Source Repositories
Author
Banerjee, Sean ; Cukic, Bojan
Author_Institution
Robot. Inst., Carnegie Mellon Univ., Pittsburgh, PA, USA
fYear
2015
fDate
23-23 May 2015
Firstpage
37
Lastpage
43
Abstract
Open source bug tracking systems provide a rich information suite that is actively used by software engineering researchers to design solutions to triaging, duplicate classification and developer assignment problems. Today, open repositories often contain in excess of 100, 000 reports, and in cases of RedHat and Mozilla, over a million. Obtaining and analyzing the contents of such datasets are both time and resource consuming. By summarizing the related work we demonstrate that researchers often focused on smaller subsets of the data, and seldom embrace the “big-dataism”. With the emergence of cloud based computation systems such as Amazon EC2, one expects it to be easier to perform large scale analyses. However, our detailed time and cost analysis indicates that significant challenges still remain. Acquiring the open source data can be time intensive, and prone to being misinterpreted as Denial of Service attacks. Generating similarity scores for all prior reports, for example, is a polynomial time problem. In this paper, we present actual costs that we incurred when analyzing the complete repositories from Eclipse, Firefox and Open Office. In our approach, we relied on computing clusters to process the data in an attempt to reduce the cost of analyzing large datasets on the cloud. We present estimated costs for a researcher attempting to analyze complete datasets from Eclipse, Mozilla, Novell and RedHat using the best possible resources. In an ideal situation, with no bottlenecks, a researcher investing just over $40, 000 and 2 weeks of non stop computing time would be able to measure similarity of problem reports within all four datasets.
Keywords
Big Data; cloud computing; computational complexity; data mining; public domain software; software engineering; Amazon EC2; Big-Dataism; Eclipse; Firefox; Novell; Open Office; RedHat; cloud based computation systems; cost analysis; data processing; denial of service attacks; open source bug tracking systems; polynomial time problem; software engineering; time analysis; very large open source repository mining; Accuracy; Computer crime; Data mining; Graphics processing units; Random access memory; XML;
fLanguage
English
Publisher
ieee
Conference_Titel
Big Data Software Engineering (BIGDSE), 2015 IEEE/ACM 1st International Workshop on
Conference_Location
Florence
Type
conf
DOI
10.1109/BIGDSE.2015.16
Filename
7166057
Link To Document