• DocumentCode
    1837630
  • Title

    Transient Fault Recovery on Chip Multiprocessor based on Dual Core Redundancy and Context Saving

  • Author

    Gong, Rui ; Dai, Kui ; Wang, Zhiying

  • Author_Institution
    Sch. of Comput., Nat. Univ. of Defense Technol., Changsha
  • fYear
    2008
  • fDate
    18-21 Nov. 2008
  • Firstpage
    148
  • Lastpage
    153
  • Abstract
    To address the increasing susceptibility of microprocessors to transient faults, many techniques have been proposed to exploit core redundancy of chip multiprocessors (CMPs). Chip-level redundant threading (CRT) is a novel approach to detect transient fault on CMPs by executing two copies of a given program on separate cores and comparing the store data. CRTR (CRT with recovery) achieves fault recovery by comparing the result of every instruction before commit. Once detecting a nonidentical result, the microporcessor could be recovered by re-executing from the wrong instruction. The inter-core communication becomes critical in CRTR. To reduce the inter-core communication bandwidth demand, a new approach, dual core redundancy with context saving (DCR-C), is proposed for fault recovery in this paper. DCR-C extends CRT by adding hardware-implemented context saving and recovery. In DCR-C, only store instructions are compared before commit as in CRT, so that the bandwidth demand can be largely reduced. The context saving is triggered by store caused cache miss. Therefore the context saving latency could be efficiently hidden. Once detecting a fault, the processor could be recovered to the saved context. The experimental results demonstrate that DCR-C is a preferable approach to achieve fault recovery with low performance overhead and inter-core bandwidth demand.
  • Keywords
    fault tolerant computing; microprocessor chips; multi-threading; redundancy; chip multiprocessor; chip-level redundant threading; context saving; dual core redundancy; intercore communication; transient fault recovery; Bandwidth; Cathode ray tubes; Circuit faults; Communication channels; Context; Delay; Fault detection; Microprocessors; Redundancy; Voltage; Chip multiprocessor; context saving; dual core redundancy; transient fault recovery;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Young Computer Scientists, 2008. ICYCS 2008. The 9th International Conference for
  • Conference_Location
    Hunan
  • Print_ISBN
    978-0-7695-3398-8
  • Electronic_ISBN
    978-0-7695-3398-8
  • Type

    conf

  • DOI
    10.1109/ICYCS.2008.271
  • Filename
    4708964