• DocumentCode
    2933482
  • Title

    A scalable double in-memory checkpoint and restart scheme towards exascale

  • Author

    Zheng, Gengbin ; Ni, Xiang ; Kalé, Laxmikant V.

  • Author_Institution
    Dept. of Comput. Sci., Univ. of Illinois at Urbana-Champaign, Urbana, IL, USA
  • fYear
    2012
  • fDate
    25-28 June 2012
  • Firstpage
    1
  • Lastpage
    6
  • Abstract
    As the size of supercomputers increases, the probability of system failure grows substantially, posing an increasingly significant challenge for scalability. It is important to provide resilience for long running applications. Checkpoint-based fault tolerance methods are effective approaches at dealing with faults. With these methods, the state of the entire parallel application is checkpointed to reliable storage. When a failure occurs, the application is restarted from a recent checkpoint. In previous work, we have demonstrated an efficient double in-memory checkpoint and restart fault tolerance scheme, which leverages Charm++´s parallel objects for checkpointing. In this paper, we further optimize the scheme by eliminating several bottlenecks caused by serialized communication. We extend the in-memory checkpointing scheme to work on MPI communication layer, and demonstrate the performance on very large scale supercomputers. For example, when running a one million atom molecular dynamics simulation on up to 64K cores of a BlueGene/P machine, the checkpoint time was in milliseconds. The restart time was measured to be less than 0.15 seconds on 64K cores.
  • Keywords
    application program interfaces; checkpointing; fault tolerant computing; mainframes; message passing; parallel processing; MPI communication layer; checkpoint-based fault tolerance methods; double in-memory checkpointing scheme; exascale; parallel application; restart scheme; very large scale supercomputers; Checkpointing; Computer crashes; Fault tolerance; Fault tolerant systems; Optimization; Program processors; Protocols;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Dependable Systems and Networks Workshops (DSN-W), 2012 IEEE/IFIP 42nd International Conference on
  • Conference_Location
    Boston, MA
  • Print_ISBN
    978-1-4673-2264-5
  • Electronic_ISBN
    978-1-4673-2265-2
  • Type

    conf

  • DOI
    10.1109/DSNW.2012.6264677
  • Filename
    6264677