Title :
Quantifying Temporal and Spatial Correlation of Failure Events for Proactive Management
Author :
Fu, Song ; Xu, Cheng-Zhong
Author_Institution :
Wayne State Univ., Detroit
Abstract :
Networked computing systems continue to grow in scale and in the complexity of their components and interactions. Component failures become norms instead of exceptions in these environments. Moreover, failure events exhibit strong correlations in time and space domain. In this paper, we develop a spherical covariance model with an adjustable timescale parameter to quantify the temporal correlation and a stochastic model to characterize spatial correlation. The models are further extended to take into account the information of application allocation to discover more correlations among failure instances. We cluster failure events based on their correlations and predict their future occurrences. Experimental results on a production coalition system, the Wayne State Grid, show the offline and online predictions by our predicting system can forecast 72.7% to 85.3% of the failure occurrences and capture failure correlations in cluster coalition environment.
Keywords :
computational complexity; grid computing; object-oriented programming; stochastic processes; Wayne State Grid; application allocation; cluster coalition environment; cluster failure events; component complexity; component failures; failure correlations; failure instances; interaction complexity; networked computing systems; predicting system; proactive management; production coalition system; spatial correlation; stochastic model; temporal correlation; timescale parameter; Computer network management; Computer network reliability; Computer networks; Distributed computing; Engineering management; Failure analysis; Grid computing; Production systems; Stochastic processes; USA Councils;
Conference_Titel :
Reliable Distributed Systems, 2007. SRDS 2007. 26th IEEE International Symposium on
Conference_Location :
Beijing
Print_ISBN :
0-7695-2995-X
DOI :
10.1109/SRDS.2007.18