• DocumentCode
    177312
  • Title

    GangES: Gang error simulation for hardware resiliency evaluation

  • Author

    Hari, Siva Kumar Sastry ; Venkatagiri, Radha ; Adve, Sarita V. ; Naeimi, Helia

  • Author_Institution
    NVIDIA, USA
  • fYear
    2014
  • fDate
    14-18 June 2014
  • Firstpage
    61
  • Lastpage
    72
  • Abstract
    As technology scales, the hardware reliability challenge affects a broad computing market, rendering traditional redundancy based solutions too expensive. Software anomaly based hardware error detection has emerged as a low cost reliability solution, but suffers from Silent Data Corruptions (SDCs). It is crucial to accurately evaluate SDC rates and identify SDC producing software locations to develop software-centric low-cost hardware resiliency solutions.A recent tool, called Relyzer, systematically analyzes an entire application´s resiliency to single bit soft-errors using a small set of carefully selected error injection sites. Relyzer provides a practical resiliency evaluation mechanism but still requires significant evaluation time, most of which is spent on error simulations. This paper presents a new technique called GangES (Gang Error Simulator) that aims to reduce error simulation time. GangES observes that a set or gang of error simulations that result in the same intermediate execution state (after their error injections) will produce the same error outcome; therefore, only one simulation of the gang needs to be completed, resulting in significant overall savings in error simulation time. GangES leverages program structure to carefully select when to compare simulations and what state to compare. For our workloads, GangES saves 57% of the total error simulation time with an overhead ofjust 1.6%. This paper also explores pure program analyses based techniques that could obviate the needfor tools such as GangES altogether. The availability of Relyzer+GangES allows us to perform a detailed evaluation of such techniques. We evaluate the accuracy of several previously proposed program metrics. We find that the metrics we considered and their various linear combinations are unable to adequately predict an instruction´s vulnerability to SDCs, further motivating the use of Relyzer+GangES style techniques as valuable solutions for the hardware error resiliency - valuation problem.
  • Keywords
    error detection; fault tolerant computing; program diagnostics; redundancy; software reliability; GangES; Relyzer; SDC producing software location; SDC rates; computing market; error injection site; error simulation time; gang error simulation; hardware error resiliency evaluation problem; hardware reliability challenge; hardware resiliency evaluation; program metrics; redundancy based solution; reliability solution; resiliency evaluation mechanism; silent data corruptions; soft-error; software anomaly based hardware error detection; software-centric low-cost hardware resiliency solution; Accuracy; Analytical models; Error analysis; Hardware; Registers; Software; Transient analysis;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Computer Architecture (ISCA), 2014 ACM/IEEE 41st International Symposium on
  • Conference_Location
    Minneapolis, MN
  • Print_ISBN
    978-1-4799-4396-8
  • Type

    conf

  • DOI
    10.1109/ISCA.2014.6853212
  • Filename
    6853212