Title :
Correctness Field Testing of Production and Decommissioned High Performance Computing Platforms at Los Alamos National Laboratory
Author :
Michalak, Sarah E. ; Rust, William N. ; Daly, John T. ; Dubois, Rew J. ; Dubois, David H.
Author_Institution :
Stat. Sci. Group, Los Alamos Nat. Lab., Los Alamos, NM, USA
Abstract :
Silent Data Corruption (SDC) can threaten the integrity of scientific calculations performed on high performance computing (HPC) platforms and other systems. To characterize this issue, correctness field testing of HPC platforms at Los Alamos National Laboratory was performed. This work presents results for 12 platforms, including over 1,000 node-years of computation performed on over 8,750 compute nodes and over 260 petabytes of data transfers involving nearly 6,000 compute nodes, and relevant lessons learned. Incorrect results characteristic of transient errors and of intermittent errors were observed. These results are a key underpinning to resilience efforts as they provide signatures of incorrect results observed under field conditions. Five incorrect results consistent with a transient error mechanism were observed, suggesting that the effects of transient errors could be mitigated. However, the observed numbers of incorrect results consistent with an intermittent error mechanism suggest that intermittent errors could substantially effect computational correctness.
Keywords :
natural sciences computing; parallel processing; HPC platforms; Los Alamos National Laboratory; SDC; correctness field testing; decommissioned high performance computing platform; intermittent error mechanism; production high performance computing platform; scientific calculations; silent data corruption; transient error mechanism; Computer architecture; Data transfer; High performance computing; Production; SDRAM; Testing; Transient analysis; Cluster computing; HPC cluster; Linpack; field testing; high performance computing; interconnect testing; intermittent error; resilience; silent data corruption; soft error; transient error;
Conference_Titel :
High Performance Computing, Networking, Storage and Analysis, SC14: International Conference for
Conference_Location :
New Orleans, LA
Print_ISBN :
978-1-4799-5499-5