• DocumentCode
    980510
  • Title

    PLR: A Software Approach to Transient Fault Tolerance for Multicore Architectures

  • Author

    Shye, Alex ; Blomstedt, Joseph ; Moseley, Tipp ; Reddi, Vijay Janapa ; Connors, Daniel A.

  • Author_Institution
    Technol. Inst., Northwestern Univ., Evanston, IL
  • Volume
    6
  • Issue
    2
  • fYear
    2009
  • Firstpage
    135
  • Lastpage
    148
  • Abstract
    Transient faults are emerging as a critical concern in the reliability of general-purpose microprocessors. As architectural trends point toward multicore designs, there is substantial interest in adapting such parallel hardware resources for transient fault tolerance. This paper presents process-level redundancy (PLR), a software technique for transient fault tolerance, which leverages multiple cores for low overhead. PLR creates a set of redundant processes per application process and systematically compares the processes to guarantee correct execution. Redundancy at the process level allows the operating system to freely schedule the processes across all available hardware resources. PLR uses a software-centric approach to transient fault tolerance, which shifts the focus from ensuring correct hardware execution to ensuring correct software execution. As a result, many benign faults that do not propagate to affect program correctness can be safely ignored. A real prototype is presented that is designed to be transparent to the application and can run on general-purpose single-threaded programs without modifications to the program, operating system, or underlying hardware. The system is evaluated for fault coverage and performance on a four-way SMP machine and provides improved performance over existing software transient fault tolerance techniques with a 16.9 percent overhead for fault detection on a set of optimized SPEC2000 binaries.
  • Keywords
    fault tolerant computing; multiprocessing systems; parallel architectures; SMP machine; general-purpose microprocessor; general-purpose single-threaded program; hardware execution; multicore architecture; multicore design; parallel hardware resources; process-level redundancy; reliability; software approach; software centric approach; software execution; transient fault tolerance; Fault tolerance; Fault-tolerance; Redundant design; process-level redundancy.; reliability; soft errors; transient faults;
  • fLanguage
    English
  • Journal_Title
    Dependable and Secure Computing, IEEE Transactions on
  • Publisher
    ieee
  • ISSN
    1545-5971
  • Type

    jour

  • DOI
    10.1109/TDSC.2008.62
  • Filename
    4668353