• DocumentCode
    720575
  • Title

    Lessons Learned Implementing User-Level Failure Mitigation in MPICH

  • Author

    Bland, Wesley ; Huiwei Lu ; Sangmin Seo ; Balaji, Pavan

  • Author_Institution
    Math. & Comput. Sci. Div., Argonne Nat. Lab., Argonne, IL, USA
  • fYear
    2015
  • fDate
    4-7 May 2015
  • Firstpage
    1123
  • Lastpage
    1126
  • Abstract
    User-level failure mitigation (ULFM) is becoming the front-running solution for process fault tolerance in MPI. While not yet adopted into the MPI standard, it is being used by applications and libraries and is being considered by the MPI Forum for future inclusion into MPI itself. In this paper, we introduce an implementation of ULFM in MPICH, a high-performance and widely portable implementation of the MPI standard. We demonstrate that while still a reference implementation, the runtime cost of the new API calls introduced is relatively low.
  • Keywords
    application program interfaces; fault tolerant computing; message passing; system recovery; API calls; MPI forum; MPI standard; MPICH; ULFM; fault tolerance; runtime cost; user-level failure mitigation; Fault tolerance; Fault tolerant systems; Libraries; Proposals; Resilience; Runtime; Standards; fault tolerance; mpi; mpich; ulfm;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Cluster, Cloud and Grid Computing (CCGrid), 2015 15th IEEE/ACM International Symposium on
  • Conference_Location
    Shenzhen
  • Type

    conf

  • DOI
    10.1109/CCGrid.2015.51
  • Filename
    7152602