Title :
Overcoming Early-Life Failure and Aging for Robust Systems
Author :
Li, Yanjing ; Kim, Young Moon ; Mintarno, Evelyn ; Gardner, Donald S. ; Mitra, Subhasish
Author_Institution :
Stanford Univ., Stanford, CA, USA
Abstract :
The prospect of system failure has increased because of device and chip-level effects in the late CMOS era. In this article, the authors present novel system-level architecture and design innovations to cope with these lifetime reliability challenges. At nanometer-scale geometries, several hardware failure mechanisms, which were largely benign in the past, are becoming visible at the system level. Moreover, recent studies indicate that, depending on the application, hardware failures can be significant contributors to overall system failure rates.Design of robust systems ensuring required hardware reliability, although nontrivial, is achievable but at high costs. Concurrent error detection during system operation is an extremely important aspect of such systems.Hardware reliability challenges arise from three major sources: early-life failures (also called infant mortality), radiation-induced soft errors, and circuit aging. Several techniques, such as Built-in Soft-Error Resilience (BISER), can be effectively used for correcting radiation-induced transient (soft) errors. Focus on early-life failures (ELF) and circuit aging was discussed. These techniques utilize specific characteristics of reliability mechanisms without incurring the high costs of traditional concurrent error detection.
Keywords :
CMOS logic circuits; error detection; microprocessor chips; software architecture; system recovery; CMOS chip-level effect; built-in soft error resilience technique; concurrent error detection; early-life failures; radiation induced soft errors; robust systems design; system level architecture; Aging; Costs; Failure analysis; Geometry; Hardware; Radiation detector circuits; Radiation detectors; Resilience; Robustness; Technological innovation; design and test; failure prediction; online self-test and diagnostics; reliability; robust-system design; self-healing;
Journal_Title :
Design & Test of Computers, IEEE
DOI :
10.1109/MDT.2009.152