• DocumentCode
    2923634
  • Title

    Low-cost program-level detectors for reducing silent data corruptions

  • Author

    Hari, Siva Kumar Sastry ; Adve, Sarita V. ; Naeimi, Helia

  • Author_Institution
    Dept. of Comput. Sci., Univ. of Illinois at Urbana-Champaign, Urbana, IL, USA
  • fYear
    2012
  • fDate
    25-28 June 2012
  • Firstpage
    1
  • Lastpage
    12
  • Abstract
    With technology scaling, transient faults are becoming an increasing threat to hardware reliability. Commodity systems must be made resilient to these in-field faults through very low-cost resiliency solutions. Software-level symptom detection techniques have emerged as promising low-cost and effective solutions. While the current user-visible Silent Data Corruption (SDC) rates for these techniques is relatively low, eliminating or significantly lowering the SDC rate is crucial for these solutions to become practically successful. Identifying and understanding program sections that cause SDCs is crucial to reducing (or eliminating) SDCs in a cost effective manner. This paper provides a detailed analysis of code sections that produce over 90% of SDCs for six applications we studied. This analysis facilitated the development of program-level detectors that catch errors in quantities that are either accumulated or active for a long duration, amortizing the detection costs. These low-cost detectors significantly reduce the dependency on redundancy-based techniques and provide more practical and flexible choice points on the performance vs. reliability trade-off curve. For example, for an average of 90%, 99%, or 100% reduction of the baseline SDC rate, the average execution overheads of our approach versus redundancy alone are respectively 12% vs. 30%, 19% vs. 43%, and 27% vs. 51%.
  • Keywords
    fault diagnosis; program testing; redundancy; software reliability; average execution overheads; baseline SDC rate reduction; code sections; commodity systems; hardware reliability; low-cost program-level detectors; redundancy-based techniques; reliability; software-level symptom detection techniques; technology scaling; transient faults; user-visible silent data corruption rates; Arrays; Circuit faults; Detectors; Hardware; Redundancy; Registers; Transient analysis; Application resiliency; Hardware reliability; Silent data corruptions; Symptom-based fault detection; Transient faults;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Dependable Systems and Networks (DSN), 2012 42nd Annual IEEE/IFIP International Conference on
  • Conference_Location
    Boston, MA
  • ISSN
    1530-0889
  • Print_ISBN
    978-1-4673-1624-8
  • Electronic_ISBN
    1530-0889
  • Type

    conf

  • DOI
    10.1109/DSN.2012.6263960
  • Filename
    6263960