DocumentCode :
185619
Title :
Failure Analysis of Jobs in Compute Clouds: A Google Cluster Case Study
Author :
Xin Chen ; Charng-Da Lu ; Pattabiraman, Karthik
Author_Institution :
Dept. of Electr. & Comput. Eng., Univ. of British Columbia, Vancouver, BC, Canada
fYear :
2014
fDate :
3-6 Nov. 2014
Firstpage :
167
Lastpage :
177
Abstract :
In this paper, we analyze a workload trace from the Google cloud cluster and characterize the observed failures. The goal of our work is to improve the understanding of failures in compute clouds. We present the statistical properties of job and task failures, and attempt to correlate them with key scheduling constraints, node operations, and attributes of users in the cloud. We also explore the potential for early failure prediction, and anomaly detection for the jobs. Based on our results, we speculate that there are many opportunities to enhance the reliability of the applications running in the cloud, such as pro-active maintenance of nodes or limiting job resubmissions. We further find that resource usage patterns of the jobs can be leveraged by failure prediction techniques. Finally, we find that the termination statuses of jobs and tasks can be clustered into six dominant categories based on the user profiles.
Keywords :
cloud computing; failure analysis; system recovery; Google cloud cluster; anomaly detection; compute clouds; failure analysis; failure prediction; job failures; scheduling constraints; task failures; user profiles; Availability; Containers; Correlation; Google; Log-normal distribution; Maintenance engineering; Job failure; anomaly detection; cloud reliability; distributions; failure prediction;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Software Reliability Engineering (ISSRE), 2014 IEEE 25th International Symposium on
Conference_Location :
Naples
ISSN :
1071-9458
Print_ISBN :
978-1-4799-6032-3
Type :
conf
DOI :
10.1109/ISSRE.2014.34
Filename :
6982624
Link To Document :
بازگشت