DocumentCode
1925904
Title
HPC failure prediction proficiency metrics
Author
Taerat, N. ; Leangsuksun, C.
Author_Institution
Louisiana Tech Univ., Ruston, LA, USA
fYear
2009
fDate
Aug. 31 2009-Sept. 4 2009
Firstpage
1
Lastpage
4
Abstract
Transient failures in large-scale HPC systems are significantly increasing due to the large number of components. Fault tolerance mechanisms exist, but they cost additional overhead per invocation to application. Thus, failure prediction is needed in order to gracefully mitigate such events and to minimize the usage of mechanism. However, the proficiency metrics for HPC failure prediction are borrowed from other related fields, mainly from statistic, data mining and information theory. Some of them fit well in some perspective, but none of them consider the perspective of lost computing time due to the prediction error. Thus, we present the incompetence study in existing metrics and introduce additional metrics cope with potential lost computing time perspective to be used together with existing metrics and justifying HPC failure prediction proficiency.
Keywords
fault tolerant computing; failure prediction proficiency metric; fault tolerance mechanism; high-performance computing system; Costs; Data mining; Fault tolerance; Fault tolerant systems; Iron; Large-scale systems; Mean square error methods; Predictive models; Runtime; Statistics;
fLanguage
English
Publisher
ieee
Conference_Titel
Cluster Computing and Workshops, 2009. CLUSTER '09. IEEE International Conference on
Conference_Location
New Orleans, LA
ISSN
1552-5244
Print_ISBN
978-1-4244-5011-4
Type
conf
DOI
10.1109/CLUSTR.2009.5289156
Filename
5289156
Link To Document