Title :
Architectures for online error detection and recovery in multicore processors
Author :
Gizopoulos, Dimitris ; Psarakis, Mihalis ; Adve, Sarita V. ; Ramachandran, Pradeep ; Hari, Siva Kumar Sastry ; Sorin, Daniel ; Meixner, Albert ; Biswas, Arijit ; Vera, Xavier
Author_Institution :
Dept. of Inf., Univ. of Piraeus, Piraeus, Greece
Abstract :
The huge investment in the design and production of multicore processors may be put at risk because the emerging highly miniaturized but unreliable fabrication technologies will impose significant barriers to the life-long reliable operation of future chips. Extremely complex, massively parallel, multi-core processor chips fabricated in these technologies will become more vulnerable to: (a) environmental disturbances that produce transient (or soft) errors, (b) latent manufacturing defects as well as aging/wearout phenomena that produce permanent (or hard) errors, and (c) verification inefficiencies that allow important design bugs to escape in the system. In an effort to cope with these reliability threats, several research teams have recently proposed multicore processor architectures that provide low-cost dependability guarantees against hardware errors and design bugs. This paper focuses on dependable multicore processor architectures that integrate solutions for online error detection, diagnosis, recovery, and repair during field operation. It discusses taxonomy of representative approaches and presents a qualitative comparison based on: hardware cost, performance overhead, types of faults detected, and detection latency. It also describes in more detail three recently proposed effective architectural approaches: a software-anomaly detection technique (SWAT), a dynamic verification technique (Argus), and a core salvaging methodology.
Keywords :
error detection; fault diagnosis; formal verification; multiprocessing systems; parallel architectures; Argus; aging phenomena; architectural approach; core salvaging methodology; design bugs; detection latency; dynamic verification technique; environmental disturbance; fault type detection; hardware cost; hardware errors; latent manufacturing defects; multicore processor architecture; multicore processor chip; multicore processor design; multicore processor production; online error detection; online error diagnosis; online error recovery; online error repair; parallel processor chip; performance overhead; software-anomaly detection technique; wearout phenomena; Built-in self-test; Hardware; Maintenance engineering; Multicore processing; Program processors; dependable architectures; multicore microprocessors; online error detection/recovery/repair;
Conference_Titel :
Design, Automation & Test in Europe Conference & Exhibition (DATE), 2011
Conference_Location :
Grenoble
Print_ISBN :
978-1-61284-208-0
DOI :
10.1109/DATE.2011.5763096