• DocumentCode
    1999015
  • Title

    An Evaluation of Different I/O Techniques for Checkpoint/Restart

  • Author

    Shahzad, Faisal ; Wittmann, M. ; Zeiser, Thomas ; Hager, Georg ; Wellein, Gerhard

  • Author_Institution
    Erlangen Regional Comput. Center, Univ. of Erlangen-Nuremberg, Erlangen, Germany
  • fYear
    2013
  • fDate
    20-24 May 2013
  • Firstpage
    1708
  • Lastpage
    1716
  • Abstract
    Today´s High Performance Computing (HPC) clusters consist of hundreds of thousands of CPUs, memory units, complex networks, and other components. Such an extreme level of hardware parallelism reduces the mean time to failure (MTTF) of the overall cluster. The future of HPC urgently demands to develop environments that facilitate programs to run successfully even in the presence of failures. Checkpoint/Restart (C/R) is one of the most common techniques to provide fault tolerance. C/R is relatively easy to implement, but typically it introduces significant overhead in the runtime of the application. In this paper, a check pointing technique is presented that significantly reduces the checkpoint overhead and is highly scalable. This is achieved by overlapping the I/O for writing the checkpoint with the computation of the application. For this asynchronous check pointing technique, a theoretical model is developed to estimate the checkpoint overhead. An implementation of this technique is then benchmarked and compared with other check pointing strategies. We show our approach to have marginal overhead as opposite to standard synchronous check pointing for typical application scenarios. A comparison with the node-level check pointing technique by using Scalable Checkpoint/Restart (SCR) library is also presented.
  • Keywords
    application program interfaces; benchmark testing; checkpointing; parallel processing; software fault tolerance; HPC clusters; I/O technique evaluation; MTTF; asynchronous checkpointing technique; checkpoint overhead; checkpointing strategies; checkpointing technique; complex networks; fault tolerance; hardware parallelism; high performance computing clusters; mean time to failure; memory units; node-level checkpointing technique; scalable checkpoint-restart library; synchronous checkpointing; Bandwidth; Benchmark testing; Checkpointing; Fault tolerance; Fault tolerant systems; Instruction sets; Libraries; MPI; asynchronous checkpointing; checkpoint/restart; fault tolerance; multi-stage checkpointing;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Parallel and Distributed Processing Symposium Workshops & PhD Forum (IPDPSW), 2013 IEEE 27th International
  • Conference_Location
    Cambridge, MA
  • Print_ISBN
    978-0-7695-4979-8
  • Type

    conf

  • DOI
    10.1109/IPDPSW.2013.145
  • Filename
    6651069