• DocumentCode
    505969
  • Title

    Exploring event correlation for failure prediction in coalitions of clusters

  • Author

    Fu, Song ; Xu, Cheng-Zhong

  • Author_Institution
    Wayne State University, Detroit, MI
  • fYear
    2007
  • fDate
    10-16 Nov. 2007
  • Firstpage
    1
  • Lastpage
    12
  • Abstract
    In large-scale networked computing systems, component failures become norms instead of exceptions. Failure prediction is a crucial technique for self-managing resource burdens. Failure events in coalition systems exhibit strong correlations in time and space domain. In this paper, we develop a spherical covariance model with an adjustable timescale parameter to quantify the temporal correlation and a stochastic model to describe spatial correlation. We further utilize the information of application allocation to discover more correlations among failure instances. We cluster failure events based on their correlations and predict their future occurrences. We implemented a failure prediction framework, called PREdictor of Failure Events Correlated Temporal-Spatially (hPREFECTs), which explores correlations among failures and forecasts the time-between-failure of future instances. We evaluate the performance of hPREFECTs in both offline prediction of failure by using the Los Alamos HPC traces and online prediction in an institute-wide clusters coalition environment. Experimental results show the system achieves more than 76% accuracy in offline prediction and more than 70% accuracy in online prediction during the time from May 2006 to April 2007.
  • Keywords
    Accuracy; Availability; Checkpointing; Computer networks; Information systems; Large-scale systems; Permission; Software measurement; Stochastic processes; Technology management; coalition clusters; failure prediction; spatial correlation; system availability; temporal correlation;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Supercomputing, 2007. SC '07. Proceedings of the 2007 ACM/IEEE Conference on
  • Conference_Location
    Reno, NV, USA
  • Print_ISBN
    978-1-59593-764-3
  • Electronic_ISBN
    978-1-59593-764-3
  • Type

    conf

  • DOI
    10.1145/1362622.1362678
  • Filename
    5348800