• DocumentCode
    3722843
  • Title

    Differentiated Failure Remediation with Action Selection for Resilient Computing

  • Author

    Song Huang;Song Fu;Nathan DeBardeleben;Qiang Guan;Cheng-Zhong Xu

  • Author_Institution
    Dept. of Comput. Sci. &
  • fYear
    2015
  • Firstpage
    199
  • Lastpage
    208
  • Abstract
    As the fault frequency is increasing with the component count in modern and future computer systems, resilience becomes increasingly critical. Existing work on anomaly detection and fault prediction enables failure avoidance techniques to circumvent fault effects proactively. In addition, traditional fault tolerance techniques can be applied to handle faults reactively. Different types of faults may affect different components of a system and have various manifestations. They need to be treated differently. However, the existing fault handling techniques uniformly treat all faults without considering their types and distinct properties. In this paper, we present a differentiated fault remediation framework with action selection (DFRAS) which integrates both preventive and reactive remediation actions differentiated for different types of faults with their urgency requirements. We investigate four major types of faults and identify candidate remediation actions. We apply the urgency requirements as constraints for action selection. We propose formal performance models to quantify the wasted time of the candidate actions, and develop a decision making method to select the best actions that minimize the overall remediation cost. We have implemented a prototype of DFRAS and evaluated its performance by simulations and experiments. Simulation and experimental results show that the integrated fault remediation strategies can significantly reduce the remediation overhead. The developed DFRAS system is lightweight, making it feasible for online fault management in large-scale systems.
  • Keywords
    "Checkpointing","Software","Fault tolerance","Fault tolerant systems","Hardware","Fault diagnosis","Runtime"
  • Publisher
    ieee
  • Conference_Titel
    Dependable Computing (PRDC), 2015 IEEE 21st Pacific Rim International Symposium on
  • Type

    conf

  • DOI
    10.1109/PRDC.2015.42
  • Filename
    7371863