• DocumentCode
    1995941
  • Title

    Revisiting the Double Checkpointing Algorithm

  • Author

    Dongarra, Jack ; Herault, Thomas ; Robert, Yannick

  • Author_Institution
    Univ. of Tennessee, Knoxville, TN, USA
  • fYear
    2013
  • fDate
    20-24 May 2013
  • Firstpage
    706
  • Lastpage
    715
  • Abstract
    Fast check pointing algorithms require distributed access to stable storage. This paper revisits the approach base upon double check pointing, and compares the blocking algorithm of Zheng, Shi and Kalé, with the non-blocking algorithm of Ni, Meneses and Kalé, in terms of both performance and risk. We also extend their model proposed to assess the impact of the overhead associated to non-blocking communications. We then provide a new peer-to-peer check pointing algorithm, called the triple check pointing algorithm, that can work at constant memory, and achieves both higher efficiency and better risk handling than the double check pointing algorithm. We provide performance and risk models for all the evaluated protocols, and compare them through comprehensive simulations.
  • Keywords
    parallel processing; blocking algorithm; distributed access; double checkpointing algorithm; fast check pointing algorithms; nonblocking communications; parallel computing environments; peer-to-peer check pointing algorithm; Algorithm design and analysis; Checkpointing; Computational modeling; Equations; Peer-to-peer computing; Protocols; Reliability; checkpoint; in-memory checkpoint; performance model; scheduling;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Parallel and Distributed Processing Symposium Workshops & PhD Forum (IPDPSW), 2013 IEEE 27th International
  • Conference_Location
    Cambridge, MA
  • Print_ISBN
    978-0-7695-4979-8
  • Type

    conf

  • DOI
    10.1109/IPDPSW.2013.11
  • Filename
    6650947