Title :
Checkpointing SPMD applications on transputer networks
Author :
Silva, Luis Moura ; Veer, Bart ; Silva, Joao Gabriel
Author_Institution :
Coimbra Univ., Portugal
Abstract :
Providing fault-tolerance for parallel/distributed applications is a problem of paramount importance, since the overall failure rate of the system increases with the number of processors, and the failure of just one processor can lend to the complete crash of the program. Checkpointing mechanisms are a good candidate to provide the continuity of the applications in the occurrence of failures. In this paper, we present an experimental study of several variations of checkpointing for SPMD (single process, multiple data) applications. We used a typical benchmark to experimentally assess the overhead, advantages and limitations of each checkpointing scheme
Keywords :
fault tolerant computing; parallel processing; performance evaluation; system recovery; transputer systems; SPMD applications; application continuity; benchmark; checkpointing scheme; distributed applications; failure rate; fault-tolerance; overhead assessment; parallel applications; program crash; transputer networks; Application software; Checkpointing; Concurrent computing; Electronic mail; Fault tolerance; Libraries; Master-slave; Parallel processing; Parallel programming; Programming profession;
Conference_Titel :
Scalable High-Performance Computing Conference, 1994., Proceedings of the
Conference_Location :
Knoxville, TN
Print_ISBN :
0-8186-5680-8
DOI :
10.1109/SHPCC.1994.296709