• DocumentCode
    572396
  • Title

    Sampling + DMR: Practical and low-overhead permanent fault detection

  • Author

    Nomura, Shuou ; Sinclair, Matthew D. ; Ho, Chen-Han ; Govindaraju, Venkatraman ; De Kruijf, Marc ; Sankaralingam, Karthikeyan

  • Author_Institution
    Vertical Res. Group, Univ. of Wisconsin - Madison, Madison, WI, USA
  • fYear
    2011
  • fDate
    4-8 June 2011
  • Firstpage
    201
  • Lastpage
    212
  • Abstract
    With technology scaling, manufacture-time and in-field permanent faults are becoming a fundamental problem. Multi-core architectures with spares can tolerate them by detecting and isolating faulty cores, but the required fault detection coverage becomes effectively 100% as the number of permanent faults increases. Dual-modular redundancy(DMR) can provide 100% coverage without assuming device-level fault models, but its overhead is excessive. In this paper, we explore a simple and low-overhead mechanism we call Sampling-DMR: run in DMR mode for a small percentage (1% of the time for example) of each periodic execution window (5 million cycles for example). Although Sampling-DMR can leave some errors undetected, we argue the permanent fault coverage is 100% because it can detect all faults eventually. SamplingDMR thus introduces a system paradigm of restricting all permanent faults´ effects to small finite windows of error occurrence. We prove an ultimate upper bound exists on total missed errors and develop a probabilistic model to analyze the distribution of the number of undetected errors and detection latency. The model is validated using full gate-level fault injection experiments for an actual processor running full application software. Sampling-DMR outperforms conventional techniques in terms of fault coverage, sustains similar detection latency guarantees, and limits energy and performance overheads to less than 2%.
  • Keywords
    fault diagnosis; multiprocessing systems; parallel architectures; probability; Sampling-DMR; device-level fault models; dual-modular redundancy; error occurrence windows; full application software; full gate-level fault injection experiments; in-field permanent faults; low-overhead permanent fault detection; manufacture-time; multicore architectures; probabilistic model; technology scaling; Analytical models; Circuit faults; Fault detection; Mathematical model; Reliability; Upper bound; Vectors; Dual-modular redundancy; Fault tolerance; Permanent Fault; Reliability; Sampling;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Computer Architecture (ISCA), 2011 38th Annual International Symposium on
  • Conference_Location
    San Jose, CA
  • ISSN
    1063-6897
  • Print_ISBN
    978-1-4503-0472-6
  • Type

    conf

  • Filename
    6307759