DocumentCode
228726
Title
Correctness Field Testing of Production and Decommissioned High Performance Computing Platforms at Los Alamos National Laboratory
Author
Michalak, Sarah E. ; Rust, William N. ; Daly, John T. ; Dubois, Rew J. ; Dubois, David H.
Author_Institution
Stat. Sci. Group, Los Alamos Nat. Lab., Los Alamos, NM, USA
fYear
2014
fDate
16-21 Nov. 2014
Firstpage
609
Lastpage
619
Abstract
Silent Data Corruption (SDC) can threaten the integrity of scientific calculations performed on high performance computing (HPC) platforms and other systems. To characterize this issue, correctness field testing of HPC platforms at Los Alamos National Laboratory was performed. This work presents results for 12 platforms, including over 1,000 node-years of computation performed on over 8,750 compute nodes and over 260 petabytes of data transfers involving nearly 6,000 compute nodes, and relevant lessons learned. Incorrect results characteristic of transient errors and of intermittent errors were observed. These results are a key underpinning to resilience efforts as they provide signatures of incorrect results observed under field conditions. Five incorrect results consistent with a transient error mechanism were observed, suggesting that the effects of transient errors could be mitigated. However, the observed numbers of incorrect results consistent with an intermittent error mechanism suggest that intermittent errors could substantially effect computational correctness.
Keywords
natural sciences computing; parallel processing; HPC platforms; Los Alamos National Laboratory; SDC; correctness field testing; decommissioned high performance computing platform; intermittent error mechanism; production high performance computing platform; scientific calculations; silent data corruption; transient error mechanism; Computer architecture; Data transfer; High performance computing; Production; SDRAM; Testing; Transient analysis; Cluster computing; HPC cluster; Linpack; field testing; high performance computing; interconnect testing; intermittent error; resilience; silent data corruption; soft error; transient error;
fLanguage
English
Publisher
ieee
Conference_Titel
High Performance Computing, Networking, Storage and Analysis, SC14: International Conference for
Conference_Location
New Orleans, LA
Print_ISBN
978-1-4799-5499-5
Type
conf
DOI
10.1109/SC.2014.55
Filename
7013037
Link To Document