DocumentCode
720575
Title
Lessons Learned Implementing User-Level Failure Mitigation in MPICH
Author
Bland, Wesley ; Huiwei Lu ; Sangmin Seo ; Balaji, Pavan
Author_Institution
Math. & Comput. Sci. Div., Argonne Nat. Lab., Argonne, IL, USA
fYear
2015
fDate
4-7 May 2015
Firstpage
1123
Lastpage
1126
Abstract
User-level failure mitigation (ULFM) is becoming the front-running solution for process fault tolerance in MPI. While not yet adopted into the MPI standard, it is being used by applications and libraries and is being considered by the MPI Forum for future inclusion into MPI itself. In this paper, we introduce an implementation of ULFM in MPICH, a high-performance and widely portable implementation of the MPI standard. We demonstrate that while still a reference implementation, the runtime cost of the new API calls introduced is relatively low.
Keywords
application program interfaces; fault tolerant computing; message passing; system recovery; API calls; MPI forum; MPI standard; MPICH; ULFM; fault tolerance; runtime cost; user-level failure mitigation; Fault tolerance; Fault tolerant systems; Libraries; Proposals; Resilience; Runtime; Standards; fault tolerance; mpi; mpich; ulfm;
fLanguage
English
Publisher
ieee
Conference_Titel
Cluster, Cloud and Grid Computing (CCGrid), 2015 15th IEEE/ACM International Symposium on
Conference_Location
Shenzhen
Type
conf
DOI
10.1109/CCGrid.2015.51
Filename
7152602
Link To Document