• DocumentCode
    228726
  • Title

    Correctness Field Testing of Production and Decommissioned High Performance Computing Platforms at Los Alamos National Laboratory

  • Author

    Michalak, Sarah E. ; Rust, William N. ; Daly, John T. ; Dubois, Rew J. ; Dubois, David H.

  • Author_Institution
    Stat. Sci. Group, Los Alamos Nat. Lab., Los Alamos, NM, USA
  • fYear
    2014
  • fDate
    16-21 Nov. 2014
  • Firstpage
    609
  • Lastpage
    619
  • Abstract
    Silent Data Corruption (SDC) can threaten the integrity of scientific calculations performed on high performance computing (HPC) platforms and other systems. To characterize this issue, correctness field testing of HPC platforms at Los Alamos National Laboratory was performed. This work presents results for 12 platforms, including over 1,000 node-years of computation performed on over 8,750 compute nodes and over 260 petabytes of data transfers involving nearly 6,000 compute nodes, and relevant lessons learned. Incorrect results characteristic of transient errors and of intermittent errors were observed. These results are a key underpinning to resilience efforts as they provide signatures of incorrect results observed under field conditions. Five incorrect results consistent with a transient error mechanism were observed, suggesting that the effects of transient errors could be mitigated. However, the observed numbers of incorrect results consistent with an intermittent error mechanism suggest that intermittent errors could substantially effect computational correctness.
  • Keywords
    natural sciences computing; parallel processing; HPC platforms; Los Alamos National Laboratory; SDC; correctness field testing; decommissioned high performance computing platform; intermittent error mechanism; production high performance computing platform; scientific calculations; silent data corruption; transient error mechanism; Computer architecture; Data transfer; High performance computing; Production; SDRAM; Testing; Transient analysis; Cluster computing; HPC cluster; Linpack; field testing; high performance computing; interconnect testing; intermittent error; resilience; silent data corruption; soft error; transient error;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    High Performance Computing, Networking, Storage and Analysis, SC14: International Conference for
  • Conference_Location
    New Orleans, LA
  • Print_ISBN
    978-1-4799-5499-5
  • Type

    conf

  • DOI
    10.1109/SC.2014.55
  • Filename
    7013037