• DocumentCode
    3549414
  • Title

    ReStore: symptom based soft error detection in microprocessors

  • Author

    Wang, Nicholas J. ; Patel, Sanjay J.

  • Author_Institution
    Dept. of Electr. & Comput. Eng., Illinois Univ., Urbana, IL, USA
  • fYear
    2005
  • fDate
    28 June-1 July 2005
  • Firstpage
    30
  • Lastpage
    39
  • Abstract
    Device scaling and large scale integration have led to growing concerns about soft errors in microprocessors. To date, in all but the most demanding applications, implementing parity and ECC for caches and other large, regular SRAM structures have been sufficient to stem the growing soft error tide. This will not be the case for long, and questions remain as to the best way to detect and recover from soft errors in the remainder of the processor - in particular, the less structured execution core. In this work, we propose the ReStore architecture, which leverages existing performance enhancing checkpointing hardware to recover from soft error events in a low cost fashion. Error detection in the ReStore architecture is novel: symptoms that hint at the presence of soft errors trigger restoration of a previous checkpoint. Example symptoms include exceptions, control flow mis-speculations, and cache or translation look-aside buffer misses. Compared to conventional soft error detection via full replication, the ReStore framework incurs little overhead, but sacrifices some amount of error coverage. These attributes make it an ideal means to provide very cost effective error coverage for processor applications that can tolerate a nonzero, but small, soft error failure rate. Our evaluation of an example ReStore implementation exhibits a 2x increase in MTBE (mean time between failures) over a standard pipeline with minimal hardware and performance overheads. The MTBF increases by 7x if ReStore is coupled with parity protection for certain pipeline structures.
  • Keywords
    computer architecture; error detection codes; fault tolerant computing; logic testing; microprocessor chips; parity check codes; ReStore architecture; cache storage; control flow mis-speculation; exception handling; hardware checkpointing; microprocessor; parity protection; pipeline structure; soft error detection; translation look-aside buffer; Checkpointing; Costs; Error correction codes; Hardware; Large scale integration; Microprocessors; Pipelines; Protection; Random access memory; Tides;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Dependable Systems and Networks, 2005. DSN 2005. Proceedings. International Conference on
  • Print_ISBN
    0-7695-2282-3
  • Type

    conf

  • DOI
    10.1109/DSN.2005.82
  • Filename
    1467777