Title :
Reliability-Aware Exceptions: Tolerating intermittent faults in microprocessor array structures
Author :
Dweik, Waleed ; Annavaram, Murali ; Dubois, Matthieu
Author_Institution :
Ming Hsieh Dept. of Electr. Eng., Univ. of Southern California, Los Angeles, CA, USA
Abstract :
In future technology nodes, reliability is expected to become a first-order design constraint. Faults encountered in a chip can be classified into three categories: transient, intermittent, and permanent. Fault classification allows a chip to take the appropriate corrective action. Mechanisms have been proposed to distinguish transient from non-transient faults where all non-transient faults are handled as permanent. Intermittent faults induced by wearout phenomena have become the dominant reliability concern in nanoscale technology, yet there is no mechanism that provides finer classification of non-transient faults into intermittent and permanent faults. In this paper, we present a new class of exceptions called Reliability-Aware Exceptions (RAEs) which provide the ability to distinguish intermittent faults in microprocessor array structures. The RAE handlers have the ability to manipulate microprocessor array structures to recover from all three categories of faults. Using RAEs, we demonstrate that the reliability of two representative microarchitecture structures, load/store queue and reorder buffer in an out-of-order processor, is improved by average factors of 1.3 and 1.95, respectively.
Keywords :
fault diagnosis; fault tolerance; integrated circuit reliability; microprocessor chips; RAEs; fault classification; first-order design constraint; intermittent fault tolerance; load-store queue; microprocessor array structures; nanoscale technology; nontransient faults; out-of-order processor; permanent faults; reliability-aware exceptions; reorder buffer; representative microarchitecture structures; wearout phenomena; Arrays; Circuit faults; Computational modeling; Microprocessors; Radiation detectors; Reliability; Transient analysis; array strucutre; de-configuration; fault injection; intermittent fault;
Conference_Titel :
Design, Automation and Test in Europe Conference and Exhibition (DATE), 2014
Conference_Location :
Dresden
DOI :
10.7873/DATE.2014.114