مرکز منطقه ای اطلاع رساني علوم و فناوري - A Large-Scale Study of Failures in High-Performance Computing Systems

DocumentCode :

1147322

Title :

A Large-Scale Study of Failures in High-Performance Computing Systems

Author :

Schroeder, Bianca ; Gibson, Garth A.

Author_Institution :

Dept. of Comput. Sci., Univ. of Toronto, Toronto, ON, Canada

Volume :

Issue :

fYear :

2010

Firstpage :

337

Lastpage :

350

Abstract :

Designing highly dependable systems requires a good understanding of failure characteristics. Unfortunately, little raw data on failures in large IT installations are publicly available. This paper analyzes failure data collected at two large high-performance computing sites. The first data set has been collected over the past nine years at Los Alamos National Laboratory (LANL) and has recently been made publicly available. It covers 23,000 failures recorded on more than 20 different systems at LANL, mostly large clusters of SMP and NUMA nodes. The second data set has been collected over the period of one year on one large supercomputing system comprising 20 nodes and more than 10,000 processors. We study the statistics of the data, including the root cause of failures, the mean time between failures, and the mean time to repair. We find, for example, that average failure rates differ wildly across systems, ranging from 20-1000 failures per year, and that time between failures is modeled well by a Weibull distribution with decreasing hazard rate. From one system to another, mean repair time varies from less than an hour to more than a day, and repair times are well modeled by a lognormal distribution.

Keywords :

Weibull distribution; fault tolerant computing; parallel machines; Los Alamos National Laboratory; Weibull distribution; high-performance computing systems; large supercomputing system; large-scale study; Availability; Data analysis; Failure analysis; Hazards; Laboratories; Large-scale systems; Resource management; Statistical distributions; Testing; Weibull distribution; Large-scale systems; empirical study; failures; field study; high-performance computing; node outages; reliability; repair time; root cause.; supercomputing; time between failures;

fLanguage :

English

Journal_Title :

Dependable and Secure Computing, IEEE Transactions on

Publisher :

ieee

ISSN :

1545-5971

Type :

jour

DOI :

10.1109/TDSC.2009.4

Filename :

4775906

Link To Document :

https://search.ricest.ac.ir/dl/search/defaultta.aspx?DTC=49&DC=1147322