• DocumentCode
    524656
  • Title

    Proactive Failure Management for High Availability Computing in Computer Clusters

  • Author

    Zhang, Ziming ; Fu, Song

  • Author_Institution
    Dept. of Comput. Sci. & Eng., New Mexico Inst. of Min. & Technol., Socorro, NM, USA
  • Volume
    1
  • fYear
    2010
  • fDate
    28-31 May 2010
  • Firstpage
    377
  • Lastpage
    381
  • Abstract
    In this paper, we propose a framework for autonomic failure management with hierarchical failure prediction functionality for coalition clusters. It analyzes node, cluster and system wide failure behaviors and forecasts the prospective failure occurrences based on quantified failure dynamics. Failure correlations are inspected by the predictor. Experimental results in a computational grid on campus show the offline and online predictions by our predictors accurately forecast the failure trend and capture failure correlations in a coalition clusters environment.
  • Keywords
    Availability; Conference management; Data analysis; Engineering management; Failure analysis; Grid computing; Large-scale systems; Performance analysis; Resource management; Technology management;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Computational Science and Optimization (CSO), 2010 Third International Joint Conference on
  • Conference_Location
    Huangshan, Anhui, China
  • Print_ISBN
    978-1-4244-6812-6
  • Electronic_ISBN
    978-1-4244-6813-3
  • Type

    conf

  • DOI
    10.1109/CSO.2010.71
  • Filename
    5533049