• DocumentCode
    2028249
  • Title

    Identifying patterns towards Algorithm Based Fault Tolerance

  • Author

    Kabir, Upama ; Goswami, Dhrubajyoti

  • Author_Institution
    Dept. of Comput. Sci. & Software Eng., Concordia Univ., Montreal, QC, Canada
  • fYear
    2015
  • fDate
    20-24 July 2015
  • Firstpage
    508
  • Lastpage
    516
  • Abstract
    Checkpoint and recovery cost imposed by coordinated checkpoint/restart (CCP/R) is a crucial performance issue for high performance computing (HPC) applications. In comparison, Algorithm Based Fault Tolerance (ABFT) is a promising fault tolerance method with low recovery overhead, but it suffers from inadequacy of universal applicability and user non-transparency. In this paper we address the overhead problem of CCP/R and some of the limitations of ABFT, and propose a solution for ABFT based on algorithmic patterns. The proposed solution is a generic fault tolerance strategy for a group of applications that exhibit similar algorithmic (structural and behavioral) features. These features together with the minimal fault recovery data (critical data) determine the fault tolerance strategy for the group of applications. We call this strategy a fault tolerance pattern (FTP). We demonstrate the idea of FTP with parallel iterative deepening A* (PIDA*) search, a generic search algorithm used to solve a wide range of discrete optimization problems (DOP). Theoretical analysis shows that our proposed solution performs better than CCP/R in terms of checkpoint and recovery time overhead. Furthermore, using FTP helps in separation of concerns, which facilitates user transparency.
  • Keywords
    checkpointing; fault tolerant computing; optimisation; parallel algorithms; search problems; ABFT; CCP/R; DOP; FTP; PIDA* search; algorithm based fault tolerance; coordinated checkpoint/restart; discrete optimization problems; fault tolerance pattern; generic search algorithm; parallel iterative deepening A* search; pattern identification; Algorithm design and analysis; Fault tolerance; Fault tolerant systems; Kernel; Program processors; Protocols; Search problems; algorithm based fault tolerance; fault tolerant parallel programs; framework for fault tolerance; parallel algorithmic patterns; patterns for fault tolerance;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    High Performance Computing & Simulation (HPCS), 2015 International Conference on
  • Conference_Location
    Amsterdam
  • Print_ISBN
    978-1-4673-7812-3
  • Type

    conf

  • DOI
    10.1109/HPCSim.2015.7237083
  • Filename
    7237083