DocumentCode
2009478
Title
NR-MPI: A Non-stop and Fault Resilient MPI
Author
Guang Suo ; Yutong Lu ; Xiangke Liao ; Min Xie ; Hongjia Cao
Author_Institution
State Key Lab. of High Performance Comput., Nat. Univ. of Defense Technol., Changsha, China
fYear
2013
fDate
15-18 Dec. 2013
Firstpage
190
Lastpage
199
Abstract
Fault resilience has became a major issue for HPC systems, in particular in the perspective of future E-scale systems, which will consist of millions of CPU cores and other components. Fault tolerant MPI was proposed to offer support of software level fault tolerance approaches. However, the widely used MPI implementations, such as MPICH and Mvapich2, provide limited support for fault tolerance. This paper proposes NR-MPI, a Non-stop and Fault Resilient MPI. NR-MPI implements the semantics of FT-MPI based on MPICH. Specifically, this paper focuses on failure detection in MPI library, online failure recovery of communicators for multiple failures, friendly programming interface extending for NR-MPI. Furthermore, to support failure recovery of applications, NR-MPI implements data backup and restore interfaces based on double in-memory checkpoint/restart. We conduct experiments with NPB benchmarks on TH-1A supercomputer. Experimental results show that NR-MPI based fault tolerant programs can recover from failures online without restarting, and the overhead is small even for applications with tens of thousands of cores.
Keywords
application program interfaces; fault tolerance; message passing; parallel processing; E-scale systems; HPC systems; NR-MPI; fault resilient MPI; nonstop MPI; software level fault tolerance; Context; Fault tolerance; Fault tolerant systems; Libraries; Programming; Resource management; Semantics; Application-level Checkpoint/Restart; NR-MPI; fault tolerant MPI; message passing interface;
fLanguage
English
Publisher
ieee
Conference_Titel
Parallel and Distributed Systems (ICPADS), 2013 International Conference on
Conference_Location
Seoul
ISSN
1521-9097
Type
conf
DOI
10.1109/ICPADS.2013.37
Filename
6808174
Link To Document