Title :
An Application Level Approach for Proactive Process Migration in MPI Applications
Author :
Cores, Iván ; Rodríguez, Gabriel ; Gonzalez, P. ; Martín, María J.
Author_Institution :
Comput. Archit. Group, Univ. of A Coruna, A Coruna, Spain
Abstract :
The running times of large-scale computational science and engineering parallel applications are usually longer than the mean-time-between-failures (MTBF). Hardware failures must be tolerated by the parallel applications to ensure that not all computation done is lost on machine failures. Check pointing and rollback recovery is a very useful technique to implement fault-tolerant applications. However, when a failure occurs, most check pointing mechanisms require a complete restart of the parallel application from the last checkpoint. This affects the efficiency of the solution, leading to an unnecessary overhead that can be avoided through a single process migration in case of failure. Although research has been carried out in this field, the solutions proposed in the literature are commonly tied to specific implementations of the parallel communication APIs or to specific runtime environments. The approach presented in this work extends an application level check pointing framework to proactively migrate MPI processes from processors when impending failures are notified, without having to restart the entire application. The main features of the proposed solution are: transparency for the user, achieved through the use of a compiler tool and a runtime library, and portability since it is not locked into a particular MPI implementation.
Keywords :
application program interfaces; checkpointing; failure analysis; fault tolerance; message passing; parallel processing; program compilers; MPI application; application level approach; checkpointing recovery; compiler tool; engineering parallel application; fault-tolerant application; hardware failure; large-scale computational science; machine failure; mean-time-between-failure; parallel application; parallel communication API; proactive process migration; rollback recovery; runtime library; single process migration; Checkpointing; Fault tolerance; Fault tolerant systems; Process control; Program processors; Proposals; Protocols; Checkpointing and Restart; Fault Tolerance; MPI; Process Migration;
Conference_Titel :
Parallel and Distributed Computing, Applications and Technologies (PDCAT), 2011 12th International Conference on
Conference_Location :
Gwangju
Print_ISBN :
978-1-4577-1807-6
DOI :
10.1109/PDCAT.2011.16