DocumentCode
3316455
Title
Design, implementation and performance of fault-tolerant message passing interface (MPI)
Author
Pitchiah, R.
fYear
2004
fDate
20-22 July 2004
Firstpage
120
Lastpage
129
Abstract
Fault tolerant MPI (FTMPI) enables fault tolerance to the MPICH, an open source GPL licensed implementation of MPI standard by Argonne National Laboratory´s Mathematics and Computer Science Division. FTMPI is a transparent fault-tolerant environment, based on synchronous checkpointing and restarting mechanism. FTMPI relies on non-multithreaded single process checkpointing library to synchronously checkpoint an application process. Global replicated system controller and cluster node specific node controller monitors and controls check pointing and recovery activities of all MPI applications within the cluster. This work details the architecture to provide fault tolerance mechanism for MPI based applications running on clusters and the performance of NAS parallel benchmarks and parallelized medium range weather forecasting models, P-T80 and P-TI26. The architecture addresses the following issues also: Replicating system controller to avoid single point of failure. Ensuring consistency of checkpoint files based on distributed two phase commit protocol, and robust fault detection hierarchy.
Keywords
checkpointing; fault tolerant computing; message passing; network operating systems; public domain software; workstation clusters; NAS parallel benchmarks; P-T80; P-TI26; asynchronous checkpointing; cluster computing; distributed two phase commit protocol; fault detection; fault-tolerant message passing interface; global replicated system controller; node controller; nonmultithreaded single process checkpointing library; open source GPL; replicating system controller; restarting mechanism; synchronous checkpointing; task migration; weather forecasting models; Application software; Checkpointing; Computer displays; Computer science; Control systems; Fault tolerance; Laboratories; Libraries; Mathematics; Message passing;
fLanguage
English
Publisher
ieee
Conference_Titel
High Performance Computing and Grid in Asia Pacific Region, 2004. Proceedings. Seventh International Conference on
Print_ISBN
0-7695-2138-X
Type
conf
DOI
10.1109/HPCASIA.2004.1324026
Filename
1324026
Link To Document