DocumentCode
720544
Title
Predicting and Mitigating Jobs Failures in Big Data Clusters
Author
Rosa, Andrea ; Chen, Lydia Y. ; Binder, Walter
Author_Institution
Fac. of Inf., Univ. della Svizzera italiana, Lugano, Switzerland
fYear
2015
fDate
4-7 May 2015
Firstpage
221
Lastpage
230
Abstract
In large-scale data enters, software and hardware failures are frequent, resulting in failures of job executions that may cause significant resource waste and performance deterioration. To proactively minimize the resource inefficiency due to job failures, it is important to identify them in advance using key job attributes. However, so far, prevailing research on datacenter workload characterization has overlooked job failures, including their patterns, root causes, and impact. In this paper, we aim to develop prediction models and mitigation policies for unsuccessful jobs, so as to reduce the resource waste in big data enters. In particular, we base our analysis on Google cluster traces, consisting of a large number of big-data jobs with a high task fan-out. We first identify the time-varying patterns of failed jobs and the contributing system features. Based on our characterization study, we develop an on-line predictive model for job failures by applying various statistical learning techniques, namely Linear Discriminate Analysis (LDA), Quadratic Discriminate Analysis (QDA), and Logistic Regression (LR). Furthermore, we propose a delay-based mitigation policy which, after a certain grace period, proactively terminates the execution of jobs that are predicted to fail. The particular objective of postponing job terminations is to strike a good tradeoffs between resource waste and false prediction of successful jobs. Our evaluation results show that the proposed method is able to significantly reduce the resource waste by 41.9% on average, and keep false terminations of jobs low, i.e., only 1%.
Keywords
Big Data; computer centres; learning (artificial intelligence); pattern clustering; regression analysis; scheduling; software fault tolerance; Big Data clusters; Big-Data jobs; Google cluster traces; LDA; QDA; datacenter workload characterization; delay-based mitigation policy; failed jobs time-varying patterns; hardware failures; job execution failures; job terminations; jobs failures mitigation; jobs failures prediction; key job attributes; large-scale data enters; linear discriminate analysis; logistic regression; mitigation policies; online predictive model; performance deterioration; prediction models; quadratic discriminate analysis; resource inefficiency; resource waste; software failures; statistical learning techniques; Google; Measurement; Predictive models; Random access memory; Throughput; Time-varying systems; Training;
fLanguage
English
Publisher
ieee
Conference_Titel
Cluster, Cloud and Grid Computing (CCGrid), 2015 15th IEEE/ACM International Symposium on
Conference_Location
Shenzhen
Type
conf
DOI
10.1109/CCGrid.2015.139
Filename
7152488
Link To Document