DocumentCode
3657093
Title
Understanding the Dark Side of Big Data Clusters: An Analysis beyond Failures
Author
Rosà;Lydia Y. Chen;Walter Binder
Author_Institution
Fac. of Inf., Univ. della Svizzera italiana, Lugano, Switzerland
fYear
2015
fDate
6/1/2015 12:00:00 AM
Firstpage
207
Lastpage
218
Abstract
Motivated by the high system complexity of today´s datacenters, a large body of related studies tries to understand workloads and resource utilization in datacenters. However, there is little work on exploring unsuccessful job and task executions. In this paper, we study three types of unsuccessful executions in traces of a Google datacenter, namely fail, kill, and eviction. The objective of our analysis is to identify their resource waste, impacts on application performance, and root causes. We first quantitatively show their strong negative impact on CPU, RAM, and DISK usage and on task slowdown. We analyze patterns of unsuccessful jobs and tasks, particularly focusing on their interdependency. Moreover, we uncover their root causes by inspecting key workload and system attributes such as machine locality and concurrency level. Our results help in the design of low-latency and fault-tolerant big-data systems.
Keywords
"Random access memory","Time factors","Google","Predictive models","Measurement","Electronic mail","Resource management"
Publisher
ieee
Conference_Titel
Dependable Systems and Networks (DSN), 2015 45th Annual IEEE/IFIP International Conference on
Type
conf
DOI
10.1109/DSN.2015.37
Filename
7266851
Link To Document