DocumentCode
2322045
Title
Enabling Application Resilience with and without the MPI Standard
Author
Bland, Wesley
Author_Institution
Innovative Comput. Lab., Univ. of Tennessee, Knoxville, TN, USA
fYear
2012
fDate
13-16 May 2012
Firstpage
746
Lastpage
751
Abstract
As recent research has demonstrated, it is becoming a necessity for large scale applications to have the ability to tolerate process failure during an execution. As the number of processes increases, checkpoint/restart fault tolerance approaches requiring large concurrent state check pointing become untenable and radically new methods to address fault tolerance are needed. This work addresses these challenges by proposing a novel approach to a minimalistic fault discovery and management model. Such a model allows application to run to completion despite fail-stop failures. As a proof of concept, in addition to the proposed fault tolerance model, an implementation in the context of the Open MPI library is provided, evaluated and analyzed.
Keywords
application program interfaces; checkpointing; fault tolerant computing; message passing; MPI standard; Open MPI library; application resilience; concurrent state checkpointing; fail-stop failures; fault tolerance approach; message passing interface; minimalistic fault discovery; proof of concept; runtime process failure; Fault tolerance; Fault tolerant systems; Libraries; Routing; Runtime; Standards; Topology; Distributed Runtime; Fault Tolerance; Message Passing Interface;
fLanguage
English
Publisher
ieee
Conference_Titel
Cluster, Cloud and Grid Computing (CCGrid), 2012 12th IEEE/ACM International Symposium on
Conference_Location
Ottawa, ON
Print_ISBN
978-1-4673-1395-7
Type
conf
DOI
10.1109/CCGrid.2012.25
Filename
6217505
Link To Document