مرکز منطقه ای اطلاع رساني علوم و فناوري - Failure Analysis of Jobs in Compute Clouds: A Google Cluster Case Study

DocumentCode :

185619

Title :

Failure Analysis of Jobs in Compute Clouds: A Google Cluster Case Study

Author :

Xin Chen ; Charng-Da Lu ; Pattabiraman, Karthik

Author_Institution :

Dept. of Electr. & Comput. Eng., Univ. of British Columbia, Vancouver, BC, Canada

fYear :

2014

fDate :

3-6 Nov. 2014

Firstpage :

167

Lastpage :

177

Abstract :

In this paper, we analyze a workload trace from the Google cloud cluster and characterize the observed failures. The goal of our work is to improve the understanding of failures in compute clouds. We present the statistical properties of job and task failures, and attempt to correlate them with key scheduling constraints, node operations, and attributes of users in the cloud. We also explore the potential for early failure prediction, and anomaly detection for the jobs. Based on our results, we speculate that there are many opportunities to enhance the reliability of the applications running in the cloud, such as pro-active maintenance of nodes or limiting job resubmissions. We further find that resource usage patterns of the jobs can be leveraged by failure prediction techniques. Finally, we find that the termination statuses of jobs and tasks can be clustered into six dominant categories based on the user profiles.

Keywords :

cloud computing; failure analysis; system recovery; Google cloud cluster; anomaly detection; compute clouds; failure analysis; failure prediction; job failures; scheduling constraints; task failures; user profiles; Availability; Containers; Correlation; Google; Log-normal distribution; Maintenance engineering; Job failure; anomaly detection; cloud reliability; distributions; failure prediction;

fLanguage :

English

Publisher :

ieee

Conference_Titel :

Software Reliability Engineering (ISSRE), 2014 IEEE 25th International Symposium on

Conference_Location :

Naples

ISSN :

1071-9458

Print_ISBN :

978-1-4799-6032-3

Type :

conf

DOI :

10.1109/ISSRE.2014.34

Filename :

6982624

Link To Document :

https://search.ricest.ac.ir/dl/search/defaultta.aspx?DTC=49&DC=185619