DocumentCode :
505969
Title :
Exploring event correlation for failure prediction in coalitions of clusters
Author :
Fu, Song ; Xu, Cheng-Zhong
Author_Institution :
Wayne State University, Detroit, MI
fYear :
2007
fDate :
10-16 Nov. 2007
Firstpage :
1
Lastpage :
12
Abstract :
In large-scale networked computing systems, component failures become norms instead of exceptions. Failure prediction is a crucial technique for self-managing resource burdens. Failure events in coalition systems exhibit strong correlations in time and space domain. In this paper, we develop a spherical covariance model with an adjustable timescale parameter to quantify the temporal correlation and a stochastic model to describe spatial correlation. We further utilize the information of application allocation to discover more correlations among failure instances. We cluster failure events based on their correlations and predict their future occurrences. We implemented a failure prediction framework, called PREdictor of Failure Events Correlated Temporal-Spatially (hPREFECTs), which explores correlations among failures and forecasts the time-between-failure of future instances. We evaluate the performance of hPREFECTs in both offline prediction of failure by using the Los Alamos HPC traces and online prediction in an institute-wide clusters coalition environment. Experimental results show the system achieves more than 76% accuracy in offline prediction and more than 70% accuracy in online prediction during the time from May 2006 to April 2007.
Keywords :
Accuracy; Availability; Checkpointing; Computer networks; Information systems; Large-scale systems; Permission; Software measurement; Stochastic processes; Technology management; coalition clusters; failure prediction; spatial correlation; system availability; temporal correlation;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Supercomputing, 2007. SC '07. Proceedings of the 2007 ACM/IEEE Conference on
Conference_Location :
Reno, NV, USA
Print_ISBN :
978-1-59593-764-3
Electronic_ISBN :
978-1-59593-764-3
Type :
conf
DOI :
10.1145/1362622.1362678
Filename :
5348800
Link To Document :
بازگشت