DocumentCode :
3025952
Title :
Fault-tolerant parallel applications with dynamic parallel schedules
Author :
Gerlach, Sebastian ; Hersch, Roger D.
Author_Institution :
Sch. of Comput. & Commun. Sci., Ecole Polytech. Fed. de Lausanne, Switzerland
fYear :
2005
fDate :
4-8 April 2005
Abstract :
Commodity computer clusters are often composed of hundreds of computing nodes. These generally off-the-shelf systems are not designed for high reliability. Node failures therefore drive the MTBF of such clusters to unacceptable levels. The software frameworks used for running parallel applications need to be fault-tolerant in order to ensure continued execution despite node failures. We propose an extension to the flow graph based Dynamic Parallel Schedules (DPS) development framework that allows non-trivial parallel applications to pursue their execution despite node failures. The proposed fault-tolerance mechanism relies on a set of backup threads located in the volatile storage of alternate nodes. These backup threads are kept up to date by duplication of the transmitted data objects and periodical checkpointing of thread states. In case of a failure, the current state of the threads that were on the failed node is reconstructed on the backup threads by re-executing operations. The corresponding valid re-execution order is automatically deduced from the data flow graph of the DPS application. Multiple simultaneous failures can be tolerated, provided that for each thread either the active thread or its corresponding backup thread survives. For threads that do not store a local state, an optimized mechanism eliminates the need for duplicate data object transmissions. The overhead induced by the fault tolerance mechanism consists mainly of duplicate data object transmissions that can, for compute bound applications, be carried out in parallel with ongoing computations. The increase in execution time due to fault tolerance therefore remains relatively low. It depends on the communication to computation ratio and on the parallel programs efficiency.
Keywords :
data flow computing; fault tolerant computing; multi-threading; processor scheduling; system recovery; workstation clusters; computer clusters; data flow graph; duplicate data object transmission; dynamic parallel schedules; fault tolerance; message logging; parallel computing; parallel programs; workstation clusters; Application software; Checkpointing; Concurrent computing; Degradation; Dynamic scheduling; Fault tolerance; Flow graphs; Processor scheduling; Workstations; Yarn; Parallel computing; checkpointing; clusters of workstations; fault tolerance; graceful degradation; message logging; parallel schedules;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Parallel and Distributed Processing Symposium, 2005. Proceedings. 19th IEEE International
Print_ISBN :
0-7695-2312-9
Type :
conf
DOI :
10.1109/IPDPS.2005.226
Filename :
1420238
Link To Document :
بازگشت