DocumentCode :
244380
Title :
Lessons Learned from the Analysis of System Failures at Petascale: The Case of Blue Waters
Author :
Di Martino, Catello ; Kalbarczyk, Zbigniew ; Iyer, Ravishankar K. ; Baccanico, Fabio ; Fullop, Joshi ; Kramer, William
Author_Institution :
Univ. of Illinois at Urbana-Champaign, Urbana, IL, USA
fYear :
2014
fDate :
23-26 June 2014
Firstpage :
610
Lastpage :
621
Abstract :
This paper provides an analysis of failures and their impact for Blue Waters, the Cray hybrid (CPU/GPU) supercomputer at the University of Illinois at Urbana-Champaign. The analysis is based on both manual failure reports and automatically generated event logs collected over 261 days. Results include i) a characterization of the root causes of single-node failures, ii) a direct assessment of the effectiveness of system-level fail over as well as memory, processor, network, GPU accelerator, and file system error resiliency, and iii) an analysis of system-wide outages. The major findings of this study are as follows. Hardware is not the main cause of system downtime. This is notwithstanding the fact that hardware-related failures are 42% of all failures. Failures caused by hardware were responsible for only 23% of the total repair time. These results are partially due to the fact that processor and memory protection mechanisms (x8 and x4 Chip kill, ECC, and parity) are able to handle a sustained rate of errors as high as 250 errors/h while providing a coverage of 99.997% out of a set of more than 1.5 million of analyzed errors. Only 28 multiple-bit errors bypassed the employed protection mechanisms. Software, on the other hand, was the largest contributor to the node repair hours (53%), despite being the cause of only 20% of the total number of failures. A total of 29 out of 39 system-wide outages involved the Lustre file system with 42% of them caused by the inadequacy of the automated fail over procedures.
Keywords :
Cray computers; failure analysis; mainframes; parallel machines; system recovery; Blue Waters; CPU-GPU supercomputer; Cray hybrid supercomputer; ECC; GPU accelerator; Lustre file system; University of Illinois; Urbana-Champaign; automatically generated event logs; file system error resiliency; hardware-related failures; manual failure reports; memory protection mechanisms; parity; single-node failures; system failure analysis; system-wide outage analysis; x4 Chip kill; x8 Chip kill; Blades; Error correction codes; Graphics processing units; Hardware; Maintenance engineering; Random access memory; Cray XE6; Cray XK7; Failure Analysis; Failure Reports; Machine Check; Nvidia GPU errors; Supercomputer;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Dependable Systems and Networks (DSN), 2014 44th Annual IEEE/IFIP International Conference on
Conference_Location :
Atlanta, GA
Type :
conf
DOI :
10.1109/DSN.2014.62
Filename :
6903615
Link To Document :
بازگشت