Title :
Fault-tolerant solutions for a MPI compute intensive application
Author :
Mourino, J.C. ; Martin, Maria J. ; Gonzalez, P. ; Doallo, Ramon
Author_Institution :
CESGA (Supercomputing Center of Galicia), Santiago de Compostela
Abstract :
The running times of large-scale computational science and engineering parallel applications, executed on clusters or grid platforms, are usually longer than the mean-time-between-failures (MTBF). Hardware failures must be tolerated by the parallel applications to ensure that no all computation done is lost on machine failures. Checkpointing and rollback recovery is a very useful technique to implement fault-tolerant applications. Although extensive research has been carried out in this field, there are few available tools to help parallel programmers to enhance with fault tolerant capability their applications. This work presents two different approaches to endow with fault tolerance the MPI version of an air quality simulation. A segment-level solution has been implemented by means of the extension of a checkpointing library for sequential codes. A variable-level solution has been implemented manually in the code. The main differences between both approaches are portability, transparency-level and checkpointing overheads. Experimental results comparing both strategies on a cluster of PCs are shown in the paper
Keywords :
application program interfaces; checkpointing; fault tolerant computing; message passing; MPI compute intensive application; air quality simulation; checkpointing library; fault tolerance; fault-tolerant application; fault-tolerant solution; message passing interface; rollback recovery; segment-level solution; sequential codes; variable-level solution; Checkpointing; Computer applications; Concurrent computing; Fault tolerance; Grid computing; Hardware; Large-scale systems; Libraries; Personal communication networks; Programming profession;
Conference_Titel :
Parallel, Distributed and Network-Based Processing, 2007. PDP '07. 15th EUROMICRO International Conference on
Conference_Location :
Napoli
Print_ISBN :
0-7695-2784-1
DOI :
10.1109/PDP.2007.44