DocumentCode :
177312
Title :
GangES: Gang error simulation for hardware resiliency evaluation
Author :
Hari, Siva Kumar Sastry ; Venkatagiri, Radha ; Adve, Sarita V. ; Naeimi, Helia
Author_Institution :
NVIDIA, USA
fYear :
2014
fDate :
14-18 June 2014
Firstpage :
61
Lastpage :
72
Abstract :
As technology scales, the hardware reliability challenge affects a broad computing market, rendering traditional redundancy based solutions too expensive. Software anomaly based hardware error detection has emerged as a low cost reliability solution, but suffers from Silent Data Corruptions (SDCs). It is crucial to accurately evaluate SDC rates and identify SDC producing software locations to develop software-centric low-cost hardware resiliency solutions.A recent tool, called Relyzer, systematically analyzes an entire application´s resiliency to single bit soft-errors using a small set of carefully selected error injection sites. Relyzer provides a practical resiliency evaluation mechanism but still requires significant evaluation time, most of which is spent on error simulations. This paper presents a new technique called GangES (Gang Error Simulator) that aims to reduce error simulation time. GangES observes that a set or gang of error simulations that result in the same intermediate execution state (after their error injections) will produce the same error outcome; therefore, only one simulation of the gang needs to be completed, resulting in significant overall savings in error simulation time. GangES leverages program structure to carefully select when to compare simulations and what state to compare. For our workloads, GangES saves 57% of the total error simulation time with an overhead ofjust 1.6%. This paper also explores pure program analyses based techniques that could obviate the needfor tools such as GangES altogether. The availability of Relyzer+GangES allows us to perform a detailed evaluation of such techniques. We evaluate the accuracy of several previously proposed program metrics. We find that the metrics we considered and their various linear combinations are unable to adequately predict an instruction´s vulnerability to SDCs, further motivating the use of Relyzer+GangES style techniques as valuable solutions for the hardware error resiliency - valuation problem.
Keywords :
error detection; fault tolerant computing; program diagnostics; redundancy; software reliability; GangES; Relyzer; SDC producing software location; SDC rates; computing market; error injection site; error simulation time; gang error simulation; hardware error resiliency evaluation problem; hardware reliability challenge; hardware resiliency evaluation; program metrics; redundancy based solution; reliability solution; resiliency evaluation mechanism; silent data corruptions; soft-error; software anomaly based hardware error detection; software-centric low-cost hardware resiliency solution; Accuracy; Analytical models; Error analysis; Hardware; Registers; Software; Transient analysis;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Computer Architecture (ISCA), 2014 ACM/IEEE 41st International Symposium on
Conference_Location :
Minneapolis, MN
Print_ISBN :
978-1-4799-4396-8
Type :
conf
DOI :
10.1109/ISCA.2014.6853212
Filename :
6853212
Link To Document :
بازگشت