DocumentCode :
1244047
Title :
Asymmetries in soft-error rates in a large cluster system
Author :
Harris, Kevin W.
Author_Institution :
High Performance Comput. Div., Hewlett-Packard Co., Nashua, NH, USA
Volume :
5
Issue :
3
fYear :
2005
Firstpage :
336
Lastpage :
342
Abstract :
Early in the deployment of the ASC Q cluster supercomputer system, an unexpectedly high rate of soft errors were observed in the board-level cache subsystems of the constituent AlphaServer ES45 systems that make up the compute component of this large cluster. A series of tests and experiments was undertaken to validate the hypothesis that this frequency was consistent with the high level of terrestrial secondary cosmic-ray neutron flux resulting from the high elevation of its installation site. The overall success of this effort is reported elsewhere in this issue. This paper reports on three secondary phenomena that were observed during these tests and experiments: Error logs were collected from all servers during a representative period and examined for nonrandom event rates, which would indicate a systematic cause. The only significant result of this exploration was the discovery of a latent soft-error discovery effect, and a self-shielding effect, whereby the servers positioned physically higher in their racks suffered disproportionately higher soft-error rates. This excess was examined and found to be consistent with established shielding effect of the high-Z composition of the constituents of the overlying systems. Experiments with individual ES45 systems in an artificial neutron beam at the Los Alamos Neutron Science Center facility have established that the soft-error rates observed in the SRAM parts is significantly dependent on the incident direction of the neutrons in the beam. These asymmetries could be exploited as part of a strategy for mitigating the frequency of soft errors in future computer systems.
Keywords :
cluster tools; cosmic ray neutrons; integrated circuit testing; mainframes; neutron effects; semiconductor device testing; ASC Q cluster supercomputer system; AlphaServer ES45 systems; artificial neutron beam; board level cache subsystems; cluster system; cosmic ray neutron flux; neutron radiation effects; nonrandom event rates; self shielding effect; semiconductor device radiation effects; semiconductor device testing; soft error discovery effect; soft error rates; Central Processing Unit; Computer errors; Error analysis; Frequency; Neutrons; Particle beams; Random access memory; Semiconductor device testing; Statistical analysis; System testing; Cosmic-ray-induced neutron; SRAMs; linear accelerators; memory testing; neutron beam; neutron radiation effects; neutron-induced soft error; randomness testing; semiconductor device radiation effects; semiconductor device testing; single-event upset; soft-error rate;
fLanguage :
English
Journal_Title :
Device and Materials Reliability, IEEE Transactions on
Publisher :
ieee
ISSN :
1530-4388
Type :
jour
DOI :
10.1109/TDMR.2005.854527
Filename :
1545894
Link To Document :
بازگشت