DocumentCode :
1115060
Title :
Flexible Rollback Recovery in Dynamic Heterogeneous Grid Computing
Author :
Jafar, Samir ; Krings, Axel ; Gautier, Thierry
Author_Institution :
Dept. of Math., Univ. of Damascus, Damascus
Volume :
6
Issue :
1
fYear :
2009
Firstpage :
32
Lastpage :
44
Abstract :
Large applications executing on Grid or cluster architectures consisting of hundreds or thousands of computational nodes create problems with respect to reliability. The source of the problems are node failures and the need for dynamic configuration over extensive run-time. This paper presents two fault-tolerance mechanisms called theft induced checkpointing and systematic event logging. These are transparent protocols capable of overcoming problems associated with both, benign faults, i.e., crash faults, and node or subnet volatility. Specifically, the protocols base the state of the execution on a dataflow graph, allowing for efficient recovery in dynamic heterogeneous systems as well as multi-threaded applications. By allowing recovery even under different numbers of processors, the approaches are especially suitable for applications with need for adaptive or reactionary configuration control. The low-cost protocols offer the capability of controlling or bounding the overhead. A formal cost model is presented, followed by an experimental evaluation. It is shown that the overhead of the protocol is very small and the maximum work lost by a crashed process is small and bounded.
Keywords :
checkpointing; data flow graphs; fault tolerant computing; grid computing; multi-threading; protocols; adaptive configuration control; cluster architecture; dataflow graph; dynamic heterogeneous grid computing; dynamic heterogeneous system recovery; fault-tolerance mechanisms; flexible rollback recovery; formal cost model; multi threaded application; reactionary configuration control; systematic event logging; theft induced checkpointing; transparent protocol; Dataflow; Distributed architectures; Fault tolerance;
fLanguage :
English
Journal_Title :
Dependable and Secure Computing, IEEE Transactions on
Publisher :
ieee
ISSN :
1545-5971
Type :
jour
DOI :
10.1109/TDSC.2008.17
Filename :
4479488
Link To Document :
بازگشت