DocumentCode
1995941
Title
Revisiting the Double Checkpointing Algorithm
Author
Dongarra, Jack ; Herault, Thomas ; Robert, Yannick
Author_Institution
Univ. of Tennessee, Knoxville, TN, USA
fYear
2013
fDate
20-24 May 2013
Firstpage
706
Lastpage
715
Abstract
Fast check pointing algorithms require distributed access to stable storage. This paper revisits the approach base upon double check pointing, and compares the blocking algorithm of Zheng, Shi and Kalé, with the non-blocking algorithm of Ni, Meneses and Kalé, in terms of both performance and risk. We also extend their model proposed to assess the impact of the overhead associated to non-blocking communications. We then provide a new peer-to-peer check pointing algorithm, called the triple check pointing algorithm, that can work at constant memory, and achieves both higher efficiency and better risk handling than the double check pointing algorithm. We provide performance and risk models for all the evaluated protocols, and compare them through comprehensive simulations.
Keywords
parallel processing; blocking algorithm; distributed access; double checkpointing algorithm; fast check pointing algorithms; nonblocking communications; parallel computing environments; peer-to-peer check pointing algorithm; Algorithm design and analysis; Checkpointing; Computational modeling; Equations; Peer-to-peer computing; Protocols; Reliability; checkpoint; in-memory checkpoint; performance model; scheduling;
fLanguage
English
Publisher
ieee
Conference_Titel
Parallel and Distributed Processing Symposium Workshops & PhD Forum (IPDPSW), 2013 IEEE 27th International
Conference_Location
Cambridge, MA
Print_ISBN
978-0-7695-4979-8
Type
conf
DOI
10.1109/IPDPSW.2013.11
Filename
6650947
Link To Document