Title :
Job Failure Analysis and Its Implications in a Large-Scale Production Grid
Author :
Li, Hui ; Groep, David ; Wolters, Lex ; Templon, Jeff
Author_Institution :
Leiden University, The Netherlands
Abstract :
In this paper we present an initial analysis of job failures in a large-scale data-intensive Grid. Based on three representative periods in production, we characterize the interarrival times and life spans of failed jobs. Different failure types are distinguished and the analysis is carried out further at the Virtual Organization (VO) level. The spatial behavior, namely where job failures occur in the Grid, is also examined. Cross-correlation structures, including how arrivals correlate with life spans of job failures, are analyzed and illustrated. We further investigate statistical models to fit the failure data and propose several failureaware scheduling strategies at the Grid level. Our results show that the overall failure rates in the Grid are quite significant, ranging from 25% to 33% of all submitted jobs. However, only 5% to 8% of the jobs failed after running on a certain Computing Element (CE). The rest of failed jobs are aborted or cancelled without running. A majority of failed jobs come from several large production VOs and a large amount of these failures are centered around several main CEs. The interarrival time processes of failed jobs are shown to be bursty, and the life spans exhibit strong autocorrelations. Based on the failure patterns we argue that it is important for the Grid resource brokers to track historical failure and take it into account in decision making. Some proactive measures and accountability issues are also discussed.
Keywords :
Application software; Autocorrelation; Collaborative work; Computer science; Decision making; Failure analysis; Grid computing; Job production systems; Large-scale systems; Processor scheduling;
Conference_Titel :
e-Science and Grid Computing, 2006. e-Science '06. Second IEEE International Conference on
Conference_Location :
Amsterdam, The Netherlands
Print_ISBN :
0-7695-2734-5
DOI :
10.1109/E-SCIENCE.2006.261111