DocumentCode :
2923634
Title :
Low-cost program-level detectors for reducing silent data corruptions
Author :
Hari, Siva Kumar Sastry ; Adve, Sarita V. ; Naeimi, Helia
Author_Institution :
Dept. of Comput. Sci., Univ. of Illinois at Urbana-Champaign, Urbana, IL, USA
fYear :
2012
fDate :
25-28 June 2012
Firstpage :
1
Lastpage :
12
Abstract :
With technology scaling, transient faults are becoming an increasing threat to hardware reliability. Commodity systems must be made resilient to these in-field faults through very low-cost resiliency solutions. Software-level symptom detection techniques have emerged as promising low-cost and effective solutions. While the current user-visible Silent Data Corruption (SDC) rates for these techniques is relatively low, eliminating or significantly lowering the SDC rate is crucial for these solutions to become practically successful. Identifying and understanding program sections that cause SDCs is crucial to reducing (or eliminating) SDCs in a cost effective manner. This paper provides a detailed analysis of code sections that produce over 90% of SDCs for six applications we studied. This analysis facilitated the development of program-level detectors that catch errors in quantities that are either accumulated or active for a long duration, amortizing the detection costs. These low-cost detectors significantly reduce the dependency on redundancy-based techniques and provide more practical and flexible choice points on the performance vs. reliability trade-off curve. For example, for an average of 90%, 99%, or 100% reduction of the baseline SDC rate, the average execution overheads of our approach versus redundancy alone are respectively 12% vs. 30%, 19% vs. 43%, and 27% vs. 51%.
Keywords :
fault diagnosis; program testing; redundancy; software reliability; average execution overheads; baseline SDC rate reduction; code sections; commodity systems; hardware reliability; low-cost program-level detectors; redundancy-based techniques; reliability; software-level symptom detection techniques; technology scaling; transient faults; user-visible silent data corruption rates; Arrays; Circuit faults; Detectors; Hardware; Redundancy; Registers; Transient analysis; Application resiliency; Hardware reliability; Silent data corruptions; Symptom-based fault detection; Transient faults;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Dependable Systems and Networks (DSN), 2012 42nd Annual IEEE/IFIP International Conference on
Conference_Location :
Boston, MA
ISSN :
1530-0889
Print_ISBN :
978-1-4673-1624-8
Electronic_ISBN :
1530-0889
Type :
conf
DOI :
10.1109/DSN.2012.6263960
Filename :
6263960
Link To Document :
بازگشت