• DocumentCode
    1244042
  • Title

    Predicting the number of fatal soft errors in Los Alamos national laboratory´s ASC Q supercomputer

  • Author

    Michalak, Sarah E. ; Harris, Kevin W. ; Hengartner, Nicolas W. ; Takala, Bruce E. ; Wender, Stephen A.

  • Author_Institution
    Stat. Sci. Group, Los Alamos Nat. Lab., NM, USA
  • Volume
    5
  • Issue
    3
  • fYear
    2005
  • Firstpage
    329
  • Lastpage
    335
  • Abstract
    Early in the deployment of the Advanced Simulation and Computing (ASC) Q supercomputer, a higher-than-expected number of single-node failures was observed. The elevated rate of single-node failures was hypothesized to be caused primarily by fatal soft errors, i.e., board-level cache (B-cache) tag (BTAG) parity errors caused by cosmic-ray-induced neutrons that led to node crashes. A series of experiments was undertaken at the Los Alamos Neutron Science Center (LANSCE) to ascertain whether fatal soft errors were indeed the primary cause of the elevated rate of single-node failures. Observed failure data from Q are consistent with the results from some of these experiments. Mitigation strategies have been developed, and scientists successfully use Q for large computations in the presence of fatal soft errors and other single-node failures.
  • Keywords
    SRAM chips; cosmic ray neutrons; error correction codes; error detection codes; failure analysis; fault tolerant computing; integrated circuit testing; mainframes; neutron effects; parallel machines; parity check codes; semiconductor device testing; ASC Q supercomputer; Los Alamos National Laboratory; SRAM chips; board level cache tag parity errors; cosmic ray induced neutrons; failure analysis; fatal soft errors; linear accelerators; memory testing; neutron beam; neutron radiation effects; node crashes; semiconductor device radiation effects; semiconductor device testing; single event upset; single node failures; soft error rate; Computational modeling; Computer errors; Error correction codes; Laboratories; Life testing; Neutrons; Random access memory; Runtime; Semiconductor device testing; Supercomputers; Cosmic-ray-induced neutron; life estimation; linear accelerators; memory testing; neutron beam; neutron radiation effects; neutron-induced soft error; semiconductor-device radiation effects; semiconductor-device testing; single-event upset; soft-error rate; static random access memory (SRAM) chips;
  • fLanguage
    English
  • Journal_Title
    Device and Materials Reliability, IEEE Transactions on
  • Publisher
    ieee
  • ISSN
    1530-4388
  • Type

    jour

  • DOI
    10.1109/TDMR.2005.855685
  • Filename
    1545893