DocumentCode :
2268203
Title :
The Effect of Correlated Failure on the Reliability of HPC Systems
Author :
Thanakornworakij, Thanadech ; Nassar, Raja ; Leangsuksun, Chokchai Box ; Paun, Mihaela
Author_Institution :
Coll. of Eng. & Sci., Louisiana Tech Univ., Ruston, LA, USA
fYear :
2011
fDate :
26-28 May 2011
Firstpage :
284
Lastpage :
288
Abstract :
High Performance Computing (HPC) system utilization can be maximized and sustained if one understands the failure behavior. In general, Time to Failure (TTF) of HPC systems has been long studied and showed that the Wei bull distribution gives the best fit. In addition, in many cases, TTF of such systems exhibit correlations. In our previous study, we developed a reliability model of an HPC system where failures among nodes are independent. However, some studies have clearly shown that in some cases nodes do not fail independently of one another. Therefore, it is of importance to develop a reliability model for an HPC system based on the occurrence of simultaneous failures. In this paper, we develop such a model and derive expressions for the probability density function of time to failure, system reliability, system failure rate, and mean time to failure (MTTF). Results show that if the failure of the components (nodes) in the system possesses a degree of dependency, the system reliability decreases.
Keywords :
Weibull distribution; software reliability; HPC systems; MTTF; Weibull distribution; high performance computing; mean time to failure; probability density function; reliability model; Computational modeling; Computers; Correlation; High performance computing; Mathematical model; Reliability; Weibull distribution; Failure rate; MTTTF; System Reliability;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Parallel and Distributed Processing with Applications Workshops (ISPAW), 2011 Ninth IEEE International Symposium on
Conference_Location :
Busan
Print_ISBN :
978-1-4577-0524-3
Electronic_ISBN :
978-0-7695-4429-8
Type :
conf
DOI :
10.1109/ISPAW.2011.55
Filename :
5951989
Link To Document :
بازگشت