DocumentCode :
3145965
Title :
Building a Fault Tolerant MPI Application: A Ring Communication Example
Author :
Hursey, Joshua ; Graham, Richard L.
Author_Institution :
Oak Ridge Nat. Lab., Oak Ridge, TN, USA
fYear :
2011
fDate :
16-20 May 2011
Firstpage :
1549
Lastpage :
1556
Abstract :
Process failure is projected to become a normal event for many long running and scalable High Performance Computing (HPC) applications. As such many application developers are investigating Algorithm Based Fault Tolerance (ABFT) techniques to improve the efficiency of application recovery beyond what existing checkpoint/restart techniques alone can provide. Unfortunately for these application developers the libraries that their applications depend upon, like Message Passing Interface (MPI), do not have standardized fault tolerance semantics. This paper introduces the reader to a set of run-through stabilization semantics being developed by the MPI Forum´s Fault Tolerance Working Group to support ABFT. Using a well-known ring communication program as the running example, this paper illustrates to application developers new to ABFT some of the issues that arise when designing a fault tolerant application. The ring program allows the paper to focus on the communication-level issues rather than the data preservation mechanisms covered by existing literature. This paper highlights a common set of issues that application developers must address in their design including program control management, duplicate message detection, termination detection, and testing. The discussion provides application developers new to ABFT with an introduction to both new interfaces becoming available, and a range of design issues that they will likely need to address regardless of their research domain.
Keywords :
application program interfaces; fault tolerant computing; message passing; program control structures; program verification; security of data; telecommunication network topology; ABFT; MPI forum fault tolerance working group; application developer; checkpoint-restart technique; data preservation mechanism; fault tolerance semantics; fault tolerance technique; fault tolerant MPI application; fault tolerant application; high performance computing application; message detection; message passing interface; program control management; ring communication program; termination detection; Context; Fault tolerance; Fault tolerant systems; Libraries; Proposals; Semantics;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Parallel and Distributed Processing Workshops and Phd Forum (IPDPSW), 2011 IEEE International Symposium on
Conference_Location :
Shanghai
ISSN :
1530-2075
Print_ISBN :
978-1-61284-425-1
Electronic_ISBN :
1530-2075
Type :
conf
DOI :
10.1109/IPDPS.2011.308
Filename :
6009014
Link To Document :
بازگشت