DocumentCode :
1995941
Title :
Revisiting the Double Checkpointing Algorithm
Author :
Dongarra, Jack ; Herault, Thomas ; Robert, Yannick
Author_Institution :
Univ. of Tennessee, Knoxville, TN, USA
fYear :
2013
fDate :
20-24 May 2013
Firstpage :
706
Lastpage :
715
Abstract :
Fast check pointing algorithms require distributed access to stable storage. This paper revisits the approach base upon double check pointing, and compares the blocking algorithm of Zheng, Shi and Kalé, with the non-blocking algorithm of Ni, Meneses and Kalé, in terms of both performance and risk. We also extend their model proposed to assess the impact of the overhead associated to non-blocking communications. We then provide a new peer-to-peer check pointing algorithm, called the triple check pointing algorithm, that can work at constant memory, and achieves both higher efficiency and better risk handling than the double check pointing algorithm. We provide performance and risk models for all the evaluated protocols, and compare them through comprehensive simulations.
Keywords :
parallel processing; blocking algorithm; distributed access; double checkpointing algorithm; fast check pointing algorithms; nonblocking communications; parallel computing environments; peer-to-peer check pointing algorithm; Algorithm design and analysis; Checkpointing; Computational modeling; Equations; Peer-to-peer computing; Protocols; Reliability; checkpoint; in-memory checkpoint; performance model; scheduling;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Parallel and Distributed Processing Symposium Workshops & PhD Forum (IPDPSW), 2013 IEEE 27th International
Conference_Location :
Cambridge, MA
Print_ISBN :
978-0-7695-4979-8
Type :
conf
DOI :
10.1109/IPDPSW.2013.11
Filename :
6650947
Link To Document :
بازگشت