DocumentCode :
2009478
Title :
NR-MPI: A Non-stop and Fault Resilient MPI
Author :
Guang Suo ; Yutong Lu ; Xiangke Liao ; Min Xie ; Hongjia Cao
Author_Institution :
State Key Lab. of High Performance Comput., Nat. Univ. of Defense Technol., Changsha, China
fYear :
2013
fDate :
15-18 Dec. 2013
Firstpage :
190
Lastpage :
199
Abstract :
Fault resilience has became a major issue for HPC systems, in particular in the perspective of future E-scale systems, which will consist of millions of CPU cores and other components. Fault tolerant MPI was proposed to offer support of software level fault tolerance approaches. However, the widely used MPI implementations, such as MPICH and Mvapich2, provide limited support for fault tolerance. This paper proposes NR-MPI, a Non-stop and Fault Resilient MPI. NR-MPI implements the semantics of FT-MPI based on MPICH. Specifically, this paper focuses on failure detection in MPI library, online failure recovery of communicators for multiple failures, friendly programming interface extending for NR-MPI. Furthermore, to support failure recovery of applications, NR-MPI implements data backup and restore interfaces based on double in-memory checkpoint/restart. We conduct experiments with NPB benchmarks on TH-1A supercomputer. Experimental results show that NR-MPI based fault tolerant programs can recover from failures online without restarting, and the overhead is small even for applications with tens of thousands of cores.
Keywords :
application program interfaces; fault tolerance; message passing; parallel processing; E-scale systems; HPC systems; NR-MPI; fault resilient MPI; nonstop MPI; software level fault tolerance; Context; Fault tolerance; Fault tolerant systems; Libraries; Programming; Resource management; Semantics; Application-level Checkpoint/Restart; NR-MPI; fault tolerant MPI; message passing interface;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Parallel and Distributed Systems (ICPADS), 2013 International Conference on
Conference_Location :
Seoul
ISSN :
1521-9097
Type :
conf
DOI :
10.1109/ICPADS.2013.37
Filename :
6808174
Link To Document :
بازگشت