DocumentCode :
3697010
Title :
Marriage Between Coordinated and Uncoordinated Checkpointing for the Exascale Era
Author :
Omer Subasi;Ferad Zyulkyarov;Osman Unsal;Jesus Labarta
Author_Institution :
Barcelona Supercomput. Center, Univ. Politec. de Catalunya, Barcelona, Spain
fYear :
2015
Firstpage :
470
Lastpage :
478
Abstract :
The state-of-the-art checkpointing techniques are projected to be prohibitively expensive in the Exascale era. These techniques are most often holistic in nature which prevents them to leverage programming model and paradigm specific advantages so as to be viable for the Exascale era. In this work, we present a unified non-hierarchical model to combine uncoordinated checkpointing with coordinated system-wide checkpointing to capitalize on programming model specific advantages. We develop closed-form formulas for performance improvement and the optimal checkpoint interval of the unified model in our analytical assessment. As an instantiation of our model, we propose to unify task-level checkpointing with a system-wide checkpointing scheme for task-parallel HPC applications. This instantiation has three distinct advantages: first it reduces performance overheads by decreasing the frequency of checkpoints in the unified system, second it features fast failure recovery by using in-memory task-local checkpoints instead of on-disk global checkpoints, and third it does not compromise from the high failure coverage typical of system-wide checkpointing.
Keywords :
"Checkpointing","Parallel processing","Mathematical model","Performance gain","Fault tolerance","Fault tolerant systems"
Publisher :
ieee
Conference_Titel :
High Performance Computing and Communications (HPCC), 2015 IEEE 7th International Symposium on Cyberspace Safety and Security (CSS), 2015 IEEE 12th International Conferen on Embedded Software and Systems (ICESS), 2015 IEEE 17th International Conference on
Type :
conf
DOI :
10.1109/HPCC-CSS-ICESS.2015.150
Filename :
7336204
Link To Document :
بازگشت