مرکز منطقه ای اطلاع رساني علوم و فناوري - HPC failure prediction proficiency metrics

DocumentCode :

1925904

Title :

HPC failure prediction proficiency metrics

Author :

Taerat, N. ; Leangsuksun, C.

Author_Institution :

Louisiana Tech Univ., Ruston, LA, USA

fYear :

2009

fDate :

Aug. 31 2009-Sept. 4 2009

Firstpage :

Lastpage :

Abstract :

Transient failures in large-scale HPC systems are significantly increasing due to the large number of components. Fault tolerance mechanisms exist, but they cost additional overhead per invocation to application. Thus, failure prediction is needed in order to gracefully mitigate such events and to minimize the usage of mechanism. However, the proficiency metrics for HPC failure prediction are borrowed from other related fields, mainly from statistic, data mining and information theory. Some of them fit well in some perspective, but none of them consider the perspective of lost computing time due to the prediction error. Thus, we present the incompetence study in existing metrics and introduce additional metrics cope with potential lost computing time perspective to be used together with existing metrics and justifying HPC failure prediction proficiency.

Keywords :

fault tolerant computing; failure prediction proficiency metric; fault tolerance mechanism; high-performance computing system; Costs; Data mining; Fault tolerance; Fault tolerant systems; Iron; Large-scale systems; Mean square error methods; Predictive models; Runtime; Statistics;

fLanguage :

English

Publisher :

ieee

Conference_Titel :

Cluster Computing and Workshops, 2009. CLUSTER '09. IEEE International Conference on

Conference_Location :

New Orleans, LA

ISSN :

1552-5244

Print_ISBN :

978-1-4244-5011-4

Type :

conf

DOI :

10.1109/CLUSTR.2009.5289156

Filename :

5289156

Link To Document :

https://search.ricest.ac.ir/dl/search/defaultta.aspx?DTC=49&DC=1925904