Title :
Fault Injection Experiments with the CLAMR Hydrodynamics Mini-App
Author :
Atkinson, Brian ; DeBardeleben, Nathan ; Qiang Guan ; Robey, Robert ; Jones, William M.
Author_Institution :
Los Alamos Nat. Lab., Ultrascale Syst. Res. Center, Los Alamos, NM, USA
Abstract :
In this paper, we present a resilience analysis of the impact of soft errors on CLAMR, a hydrodynamics mini-app for high performance computing (HPC). We utilize F-SEFI, a fine grainedfault injection tool, to inject faults into the kernel routines of CLAMR. We demonstrate visually the impact of these faults as they are either benign (have no impact on the results), cause silent data corruption (SDC), or cause the application to crash due to instabilities. We quantify the probability that an injected fault will cause CLAMR to transition to one of the above three states using F-SEFI. Finally, we explore the relationship between the application´s fault characteristics and when the fault is injected in simulation time. Overall, we find that 17% and 24% of the faults propagate into SDC and crashes respectively.
Keywords :
fault tolerant computing; hydrodynamics; parallel processing; program diagnostics; CLAMR hydrodynamics mini-app; F-SEFI tool; HPC; SDC; cell-based adaptive mesh refinement; fault injection experiments; fine grained fault injection tool; high performance computing; resilience analysis; silent data corruption; Circuit faults; Computer crashes; Fault tolerance; Fault tolerant systems; Kernel; Laboratories; Resilience; fault injection; fault-tolerance; hydrodynamics; mini-app; resilience;
Conference_Titel :
Software Reliability Engineering Workshops (ISSREW), 2014 IEEE International Symposium on
Conference_Location :
Naples
DOI :
10.1109/ISSREW.2014.51