Title :
HPC failure prediction proficiency metrics
Author :
Taerat, N. ; Leangsuksun, C.
Author_Institution :
Louisiana Tech Univ., Ruston, LA, USA
fDate :
Aug. 31 2009-Sept. 4 2009
Abstract :
Transient failures in large-scale HPC systems are significantly increasing due to the large number of components. Fault tolerance mechanisms exist, but they cost additional overhead per invocation to application. Thus, failure prediction is needed in order to gracefully mitigate such events and to minimize the usage of mechanism. However, the proficiency metrics for HPC failure prediction are borrowed from other related fields, mainly from statistic, data mining and information theory. Some of them fit well in some perspective, but none of them consider the perspective of lost computing time due to the prediction error. Thus, we present the incompetence study in existing metrics and introduce additional metrics cope with potential lost computing time perspective to be used together with existing metrics and justifying HPC failure prediction proficiency.
Keywords :
fault tolerant computing; failure prediction proficiency metric; fault tolerance mechanism; high-performance computing system; Costs; Data mining; Fault tolerance; Fault tolerant systems; Iron; Large-scale systems; Mean square error methods; Predictive models; Runtime; Statistics;
Conference_Titel :
Cluster Computing and Workshops, 2009. CLUSTER '09. IEEE International Conference on
Conference_Location :
New Orleans, LA
Print_ISBN :
978-1-4244-5011-4
DOI :
10.1109/CLUSTR.2009.5289156