DocumentCode
2028249
Title
Identifying patterns towards Algorithm Based Fault Tolerance
Author
Kabir, Upama ; Goswami, Dhrubajyoti
Author_Institution
Dept. of Comput. Sci. & Software Eng., Concordia Univ., Montreal, QC, Canada
fYear
2015
fDate
20-24 July 2015
Firstpage
508
Lastpage
516
Abstract
Checkpoint and recovery cost imposed by coordinated checkpoint/restart (CCP/R) is a crucial performance issue for high performance computing (HPC) applications. In comparison, Algorithm Based Fault Tolerance (ABFT) is a promising fault tolerance method with low recovery overhead, but it suffers from inadequacy of universal applicability and user non-transparency. In this paper we address the overhead problem of CCP/R and some of the limitations of ABFT, and propose a solution for ABFT based on algorithmic patterns. The proposed solution is a generic fault tolerance strategy for a group of applications that exhibit similar algorithmic (structural and behavioral) features. These features together with the minimal fault recovery data (critical data) determine the fault tolerance strategy for the group of applications. We call this strategy a fault tolerance pattern (FTP). We demonstrate the idea of FTP with parallel iterative deepening A* (PIDA*) search, a generic search algorithm used to solve a wide range of discrete optimization problems (DOP). Theoretical analysis shows that our proposed solution performs better than CCP/R in terms of checkpoint and recovery time overhead. Furthermore, using FTP helps in separation of concerns, which facilitates user transparency.
Keywords
checkpointing; fault tolerant computing; optimisation; parallel algorithms; search problems; ABFT; CCP/R; DOP; FTP; PIDA* search; algorithm based fault tolerance; coordinated checkpoint/restart; discrete optimization problems; fault tolerance pattern; generic search algorithm; parallel iterative deepening A* search; pattern identification; Algorithm design and analysis; Fault tolerance; Fault tolerant systems; Kernel; Program processors; Protocols; Search problems; algorithm based fault tolerance; fault tolerant parallel programs; framework for fault tolerance; parallel algorithmic patterns; patterns for fault tolerance;
fLanguage
English
Publisher
ieee
Conference_Titel
High Performance Computing & Simulation (HPCS), 2015 International Conference on
Conference_Location
Amsterdam
Print_ISBN
978-1-4673-7812-3
Type
conf
DOI
10.1109/HPCSim.2015.7237083
Filename
7237083
Link To Document