DocumentCode
3596079
Title
SPBC: Leveraging the characteristics of MPI HPC applications for scalable checkpointing
Author
Ropars, Thomas ; Martsinkevich, Tatiana V. ; Guermouche, Abdou ; Schiper, Andre ; Cappello, Franck
Author_Institution
Ecole Polytech. Fed. de Lausanne (EPFL), Lausanne, Switzerland
fYear
2013
Firstpage
1
Lastpage
12
Abstract
The high failure rate expected for future supercomputers requires the design of new fault tolerant solutions. Most checkpointing protocols are designed to work with any message-passing application but sudder from scalability issues at extreme scale. We take a different approach: We identify a property common to many HPC applications, namely channel-determinism, and introduce a new partial order relation, called always-happens-before relation, between events of such applications. Leveraging these two concepts, we design a protocol that combines an unprecedented set of features. Our protocol called SPBC combines in a hierarchical way coordinated checkpointing and message logging. It is the first protocol that provides failure containment without logging any information reliably apart from process checkpoints, and this, without penalizing recovery performance. Experiments run with a representative set of HPC workloads demonstrate a good performance of our protocol during both, failure-free execution and recovery.
Keywords
application program interfaces; checkpointing; failure analysis; message passing; parallel processing; protocols; software fault tolerance; software reliability; HPC workloads; MPI HPC; SPBC protocol; always-happens-before relation; channel-determinism; checkpointing protocols; failure containment; fault tolerant solutions; hierarchical way coordinated checkpointing; high failure rate; message logging; message-passing; partial order relation; scalable checkpointing; supercomputers; Checkpointing; Fault tolerance; Fault tolerant systems; Libraries; Payloads; Protocols; Algorithms; Reliability;
fLanguage
English
Publisher
ieee
Conference_Titel
High Performance Computing, Networking, Storage and Analysis (SC), 2013 International Conference for
Print_ISBN
978-1-4503-2378-9
Type
conf
DOI
10.1145/2503210.2503271
Filename
6877441
Link To Document