Title :
Enabling Application Resilience with and without the MPI Standard
Author_Institution :
Innovative Comput. Lab., Univ. of Tennessee, Knoxville, TN, USA
Abstract :
As recent research has demonstrated, it is becoming a necessity for large scale applications to have the ability to tolerate process failure during an execution. As the number of processes increases, checkpoint/restart fault tolerance approaches requiring large concurrent state check pointing become untenable and radically new methods to address fault tolerance are needed. This work addresses these challenges by proposing a novel approach to a minimalistic fault discovery and management model. Such a model allows application to run to completion despite fail-stop failures. As a proof of concept, in addition to the proposed fault tolerance model, an implementation in the context of the Open MPI library is provided, evaluated and analyzed.
Keywords :
application program interfaces; checkpointing; fault tolerant computing; message passing; MPI standard; Open MPI library; application resilience; concurrent state checkpointing; fail-stop failures; fault tolerance approach; message passing interface; minimalistic fault discovery; proof of concept; runtime process failure; Fault tolerance; Fault tolerant systems; Libraries; Routing; Runtime; Standards; Topology; Distributed Runtime; Fault Tolerance; Message Passing Interface;
Conference_Titel :
Cluster, Cloud and Grid Computing (CCGrid), 2012 12th IEEE/ACM International Symposium on
Conference_Location :
Ottawa, ON
Print_ISBN :
978-1-4673-1395-7
DOI :
10.1109/CCGrid.2012.25