• DocumentCode
    1925904
  • Title

    HPC failure prediction proficiency metrics

  • Author

    Taerat, N. ; Leangsuksun, C.

  • Author_Institution
    Louisiana Tech Univ., Ruston, LA, USA
  • fYear
    2009
  • fDate
    Aug. 31 2009-Sept. 4 2009
  • Firstpage
    1
  • Lastpage
    4
  • Abstract
    Transient failures in large-scale HPC systems are significantly increasing due to the large number of components. Fault tolerance mechanisms exist, but they cost additional overhead per invocation to application. Thus, failure prediction is needed in order to gracefully mitigate such events and to minimize the usage of mechanism. However, the proficiency metrics for HPC failure prediction are borrowed from other related fields, mainly from statistic, data mining and information theory. Some of them fit well in some perspective, but none of them consider the perspective of lost computing time due to the prediction error. Thus, we present the incompetence study in existing metrics and introduce additional metrics cope with potential lost computing time perspective to be used together with existing metrics and justifying HPC failure prediction proficiency.
  • Keywords
    fault tolerant computing; failure prediction proficiency metric; fault tolerance mechanism; high-performance computing system; Costs; Data mining; Fault tolerance; Fault tolerant systems; Iron; Large-scale systems; Mean square error methods; Predictive models; Runtime; Statistics;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Cluster Computing and Workshops, 2009. CLUSTER '09. IEEE International Conference on
  • Conference_Location
    New Orleans, LA
  • ISSN
    1552-5244
  • Print_ISBN
    978-1-4244-5011-4
  • Type

    conf

  • DOI
    10.1109/CLUSTR.2009.5289156
  • Filename
    5289156