Title :
Error-Resilient Design Techniques for Reliable and Dependable Computing
Author :
Das, Shidhartha ; Bull, David M. ; Whatmough, Paul N.
Author_Institution :
ARM Ltd., Cambridge, UK
Abstract :
Integrated circuits in modern systems-on-chip and microprocessors are typically operated with sufficient timing margins to mitigate the impact of rising process, voltage, and temperature (PVT) variations at advanced process nodes. The widening margins required for ensuring robust computation inevitably lead to conservative designs with unacceptable energy-efficiency overheads. Reconciling the conflicting objectives imposed by variation mitigation and energy-efficient computing will require fundamental departures from conventional circuit and system design practices. This paper posits error-resilient general-purpose computing as an effective approach for achieving this. We review resilient techniques that exploit tolerance to timing errors to automatically compensate for variations and dynamically tune a system to its most efficient operating point. We present the Razor approach as a pioneering example of such a technique. We present silicon measurement results from multiple industrial and academic demonstration systems that employ Razor dynamic voltage and frequency management. In particular, we highlight the application of Razor to two specific platforms. The first is an ARM-based industrial prototype where Razor dynamic adaptation leads to 52% energy savings at 1 GHz operation. The second platform applies Razor for robust operation in the presence of radiation-induced Single Event Upsets. These efforts clearly demonstrate how energy-efficient compute engines can be designed by combining timing-error resiliency with optimizations across algorithms, circuits, and microarchitecture boundaries.
Keywords :
elemental semiconductors; integrated circuit design; integrated circuit reliability; microprocessor chips; silicon; system-on-chip; ARM-based industrial prototype; Razor dynamic voltage; Si; error-resilient design; frequency 1 GHz; frequency management; integrated circuits; microprocessors; radiation-induced single event upsets; systems-on-chip; timing-error resiliency; Energy efficiency; Flip-flops; Inverters; Latches; Pipelines; Reliability; Timing; Energy-efficient Digital Design; Error-resilient Computing; Error-resilient computing; Variation Mitigation; energy-efficient digital design; variation mitigation;
Journal_Title :
Device and Materials Reliability, IEEE Transactions on
DOI :
10.1109/TDMR.2015.2389038