• DocumentCode
    2933248
  • Title

    Asynchronous checkpoint migration with MRNet in the Scalable Checkpoint / Restart Library

  • Author

    Mohror, Kathryn ; Moody, Adam ; De Supinski, Bronis R.

  • fYear
    2012
  • fDate
    25-28 June 2012
  • Firstpage
    1
  • Lastpage
    6
  • Abstract
    Applications running on today´s supercomputers tolerate failures by periodically saving their state in checkpoint files on stable storage, such as a parallel file system. Although this approach is simple, the overhead of writing the checkpoints can be prohibitive, especially for large-scale jobs. In this paper, we present initial results of an enhancement to our Scalable Checkpoint / Restart Library (SCR). We employ MRNet, a tree-based overlay network library, to transfer checkpoints from the compute nodes to the parallel file system asynchronously. This enhancement increases application efficiency by removing the need for an application to block while checkpoints are transferred to the parallel file system. We show that the integration of SCR with MRNet can reduce the time spent in I/O operations by as much as 15×. However, our experiments exposed new scalability issues with our initial implementation. We discuss the sources of the scalability problems and our plans to address them.
  • Keywords
    checkpointing; fault tolerant computing; mainframes; network operating systems; overlay networks; parallel databases; trees (mathematics); MRNet; SCR library; application efficiency; asynchronous checkpoint migration; checkpoint transfer; failure tolerance; file checkpointing; parallel file system; scalability issues; scalable checkpoint-restart library; stable storage; supercomputers; tree-based overlay network library; Checkpointing; High performance computing; Libraries; Redundancy; Scalability; Thyristors; Writing;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Dependable Systems and Networks Workshops (DSN-W), 2012 IEEE/IFIP 42nd International Conference on
  • Conference_Location
    Boston, MA
  • Print_ISBN
    978-1-4673-2264-5
  • Electronic_ISBN
    978-1-4673-2265-2
  • Type

    conf

  • DOI
    10.1109/DSNW.2012.6264668
  • Filename
    6264668