• DocumentCode
    2009478
  • Title

    NR-MPI: A Non-stop and Fault Resilient MPI

  • Author

    Guang Suo ; Yutong Lu ; Xiangke Liao ; Min Xie ; Hongjia Cao

  • Author_Institution
    State Key Lab. of High Performance Comput., Nat. Univ. of Defense Technol., Changsha, China
  • fYear
    2013
  • fDate
    15-18 Dec. 2013
  • Firstpage
    190
  • Lastpage
    199
  • Abstract
    Fault resilience has became a major issue for HPC systems, in particular in the perspective of future E-scale systems, which will consist of millions of CPU cores and other components. Fault tolerant MPI was proposed to offer support of software level fault tolerance approaches. However, the widely used MPI implementations, such as MPICH and Mvapich2, provide limited support for fault tolerance. This paper proposes NR-MPI, a Non-stop and Fault Resilient MPI. NR-MPI implements the semantics of FT-MPI based on MPICH. Specifically, this paper focuses on failure detection in MPI library, online failure recovery of communicators for multiple failures, friendly programming interface extending for NR-MPI. Furthermore, to support failure recovery of applications, NR-MPI implements data backup and restore interfaces based on double in-memory checkpoint/restart. We conduct experiments with NPB benchmarks on TH-1A supercomputer. Experimental results show that NR-MPI based fault tolerant programs can recover from failures online without restarting, and the overhead is small even for applications with tens of thousands of cores.
  • Keywords
    application program interfaces; fault tolerance; message passing; parallel processing; E-scale systems; HPC systems; NR-MPI; fault resilient MPI; nonstop MPI; software level fault tolerance; Context; Fault tolerance; Fault tolerant systems; Libraries; Programming; Resource management; Semantics; Application-level Checkpoint/Restart; NR-MPI; fault tolerant MPI; message passing interface;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Parallel and Distributed Systems (ICPADS), 2013 International Conference on
  • Conference_Location
    Seoul
  • ISSN
    1521-9097
  • Type

    conf

  • DOI
    10.1109/ICPADS.2013.37
  • Filename
    6808174