DocumentCode :
642808
Title :
Fault detection and recovery efficiency co-optimization through compile-time analysis and runtime adaptation
Author :
Hao Chen ; Chengmo Yang
Author_Institution :
Dept. of Electr. & Comput. Eng., Univ. of Delaware, Newark, DE, USA
fYear :
2013
fDate :
Sept. 29 2013-Oct. 4 2013
Firstpage :
1
Lastpage :
10
Abstract :
The ever scaling-down feature size and noise margin keep elevating hardware failure rates, requiring the incorporation of fault tolerance into computer systems. One fault tolerance scheme that receives a lot of research attention is redundant execution. However, existing solutions are developed under the assumption that the fault rate is low. These techniques either solely focus on fault detection, or sometimes even increase recovery cost to reduce fault detection overhead. The lack of overall efficiency makes them insufficient and inappropriate for embedded systems with tight energy and cost budget. Our study shows that checkpoint frequency and fault rate are two critical parameters determining the overall fault detection and recovery overhead. To co-optimize detection and recovery, we statically construct a mathematical model, capable of taking application and architecture characteristics into consideration and identifying the optimal checkpoint frequency of an application for a given fault rate. Moreover, as the fault rate is infeasible to predict a priori, we furthermore propose a set of heuristics, enabling the system to dynamically monitor the fault rate and adapt the checkpoint frequency accordingly. The efficacy of the static and the adaptive optimizations is evaluated through detailed instructionlevel simulation. The results show that the optimal checkpoint frequency identified by the static model is very close to the actual value (6% deviation) and the run-time adaptation scheme effectively reduces the overhead caused by the unpredictability in fault rate.
Keywords :
checkpointing; fault tolerant computing; program compilers; program diagnostics; checkpoint frequency; compile-time analysis; embedded systems; fault detection efficiency co-optimization; fault detection overhead; fault rate; fault recovery efficiency co-optimization; fault recovery overhead; fault tolerance; feature size; instruction-level simulation; noise margin; redundant execution; runtime adaptation; Adaptation models; Checkpointing; Fault detection; Fault tolerant systems; Mathematical model; Registers; Runtime;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Compilers, Architecture and Synthesis for Embedded Systems (CASES), 2013 International Conference on
Conference_Location :
Montreal, QC
Type :
conf
DOI :
10.1109/CASES.2013.6662528
Filename :
6662528
Link To Document :
بازگشت