• DocumentCode
    9918
  • Title

    Detailed Modeling and Evaluation of a Scalable Multilevel Checkpointing System

  • Author

    Mohror, Kathryn ; Moody, Adam ; Bronevetsky, Greg ; de Supinski, Bronis R.

  • Author_Institution
    Lawrence Livermore Nat. Lab., Livermore, CA, USA
  • Volume
    25
  • Issue
    9
  • fYear
    2014
  • fDate
    Sept. 2014
  • Firstpage
    2255
  • Lastpage
    2263
  • Abstract
    High-performance computing (HPC) systems are growing more powerful by utilizing more components. As the system mean time before failure correspondingly drops, applications must checkpoint frequently to make progress. However, at scale, the cost of checkpointing becomes prohibitive. A solution to this problem is multilevel checkpointing, which employs multiple types of checkpoints in a single run. Lightweight checkpoints can handle the most common failure modes, while more expensive checkpoints can handle severe failures. We designed a multilevel checkpointing library, the Scalable Checkpoint/Restart (SCR) library, that writes lightweight checkpoints to node-local storage in addition to the parallel file system. We present probabilistic Markov models of SCR´s performance. We show that on future large-scale systems, SCR can lead to a gain in machine efficiency of up to 35 percent, and reduce the load on the parallel file system by a factor of two. Additionally, we predict that checkpoint scavenging, or only writing checkpoints to the parallel file system on application termination, can reduce the load on the parallel file system by 20 × on today´s systems and still maintain high application efficiency.
  • Keywords
    checkpointing; parallel processing; SCR library; application termination; checkpoint scavenging; high-performance computing systems; lightweight checkpoints; machine efficiency; multilevel checkpointing library; node-local storage; parallel file system; probabilistic Markov models; scalable checkpoint-restart library; scalable multilevel checkpointing system; Checkpointing; Computational modeling; Load modeling; Markov processes; Mathematical model; Predictive models; Thyristors; Fault tolerance; evaluation; measurement; modeling; simulation of multiple-processor systems;
  • fLanguage
    English
  • Journal_Title
    Parallel and Distributed Systems, IEEE Transactions on
  • Publisher
    ieee
  • ISSN
    1045-9219
  • Type

    jour

  • DOI
    10.1109/TPDS.2013.100
  • Filename
    6494566