DocumentCode :
228721
Title :
Understanding Soft Error Resiliency of Blue Gene/Q Compute Chip through Hardware Proton Irradiation and Software Fault Injection
Author :
Chen-Yong Cher ; Gupta, Meeta S. ; Bose, Pradip ; Muller, K. Paul
Author_Institution :
IBM T. J. Watson Res. Center, Yorktown Heights, NY, USA
fYear :
2014
fDate :
16-21 Nov. 2014
Firstpage :
587
Lastpage :
596
Abstract :
Soft Error Resiliency is a major concern for Petascale high performance computing (HPC) systems. Blue Gene/Q (BG/Q) is the third generation of IBM´s massively parallel, energy efficient Blue Gene series of supercomputers. The principal goal of this work is to understand the interaction between Blue-Gene/Q´s hardware resiliency features and high-performance applications through proton irradiation of a real chip, and software resiliency inherent in these applications through application-level fault injection (AFI) experiments. From the proton irradiation experiments we derived that the mean time between correctable errors at sea level of the SRAM-based register files and Level-1 caches for a system similar to the scale of Sequoia system. From the AFI experiments, we characterized relative vulnerability among the applications in both general purpose and floating point register files. We categorized and quantified the failure outcomes, and discovered characteristics in the applications that lead to many masking improvement opportunities.
Keywords :
SRAM chips; cache storage; floating point arithmetic; mainframes; microprocessor chips; parallel machines; radiation hardening (electronics); software fault tolerance; AFI experiments; BlueGene/Q compute chip; BlueGene/Q hardware resiliency features; HPC systems; Level-1 caches; SRAM-based register files; Sequoia system; application-level fault injection experiments; correctable errors; floating point register files; hardware proton irradiation; petascale high performance computing systems; soft error resiliency; software fault injection; software resiliency; third generation IBM massively parallel energy efficient Blue Gene series supercomputers; Circuit faults; Hardware; Particle beams; Protons; Radiation effects; Registers; Software; chip irradiation; co-design; fault injection; high-performance applications; soft error rate;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
High Performance Computing, Networking, Storage and Analysis, SC14: International Conference for
Conference_Location :
New Orleans, LA
Print_ISBN :
978-1-4799-5499-5
Type :
conf
DOI :
10.1109/SC.2014.53
Filename :
7013035
Link To Document :
بازگشت