• DocumentCode
    3055578
  • Title

    Using Process-Level Redundancy to Exploit Multiple Cores for Transient Fault Tolerance

  • Author

    Shye, Alex ; Moseley, Tipp ; Reddi, Vijay Janapa ; Blomstedt, Joseph ; Connors, Daniel A.

  • Author_Institution
    U. of Colorado, Boulder
  • fYear
    2007
  • fDate
    25-28 June 2007
  • Firstpage
    297
  • Lastpage
    306
  • Abstract
    Transient faults are emerging as a critical concern in the reliability of general-purpose microprocessors. As architectural trends point towards multi-threaded multi-core designs, there is substantial interest in adapting such parallel hardware resources for transient fault tolerance. This paper proposes a software-based multi-core alternative for transient fault tolerance using process-level redundancy (PLR). PLR creates a set of redundant processes per application process and systematically compares the processes to guarantee correct execution. Redundancy at the process level allows the operating system to freely schedule the processes across all available hardware resources. PLR´s software-centric approach to transient fault tolerance shifts the focus from ensuring correct hardware execution to ensuring correct software execution. As a result, PLR ignores many benign faults that do not propagate to affect program correctness. A real PLR prototype for running single-threaded applications is presented and evaluated for fault coverage and performance. On a 4-way SMP machine, PLR provides improved performance over existing software transient fault tolerance techniques with 16.9% overhead for fault detection on a set of optimized SPEC2000 binaries.
  • Keywords
    multi-threading; redundancy; software fault tolerance; general-purpose microprocessors; multi-threaded multi-core designs; parallel hardware resources; process-level redundancy; transient fault tolerance; Application software; Fault detection; Fault tolerance; Hardware; Microprocessors; Operating systems; Prototypes; Redundancy; Software performance; Software prototyping;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Dependable Systems and Networks, 2007. DSN '07. 37th Annual IEEE/IFIP International Conference on
  • Conference_Location
    Edinburgh
  • Print_ISBN
    0-7695-2855-4
  • Type

    conf

  • DOI
    10.1109/DSN.2007.98
  • Filename
    4272981