DocumentCode :
3722843
Title :
Differentiated Failure Remediation with Action Selection for Resilient Computing
Author :
Song Huang;Song Fu;Nathan DeBardeleben;Qiang Guan;Cheng-Zhong Xu
Author_Institution :
Dept. of Comput. Sci. &
fYear :
2015
Firstpage :
199
Lastpage :
208
Abstract :
As the fault frequency is increasing with the component count in modern and future computer systems, resilience becomes increasingly critical. Existing work on anomaly detection and fault prediction enables failure avoidance techniques to circumvent fault effects proactively. In addition, traditional fault tolerance techniques can be applied to handle faults reactively. Different types of faults may affect different components of a system and have various manifestations. They need to be treated differently. However, the existing fault handling techniques uniformly treat all faults without considering their types and distinct properties. In this paper, we present a differentiated fault remediation framework with action selection (DFRAS) which integrates both preventive and reactive remediation actions differentiated for different types of faults with their urgency requirements. We investigate four major types of faults and identify candidate remediation actions. We apply the urgency requirements as constraints for action selection. We propose formal performance models to quantify the wasted time of the candidate actions, and develop a decision making method to select the best actions that minimize the overall remediation cost. We have implemented a prototype of DFRAS and evaluated its performance by simulations and experiments. Simulation and experimental results show that the integrated fault remediation strategies can significantly reduce the remediation overhead. The developed DFRAS system is lightweight, making it feasible for online fault management in large-scale systems.
Keywords :
"Checkpointing","Software","Fault tolerance","Fault tolerant systems","Hardware","Fault diagnosis","Runtime"
Publisher :
ieee
Conference_Titel :
Dependable Computing (PRDC), 2015 IEEE 21st Pacific Rim International Symposium on
Type :
conf
DOI :
10.1109/PRDC.2015.42
Filename :
7371863
Link To Document :
بازگشت