• DocumentCode
    1907958
  • Title

    Modeling the Impact of Checkpoints on Next-Generation Systems

  • Author

    Oldfield, Ron A. ; Arunagiri, Sarala ; Teller, Patricia J. ; Seelam, Seetharami ; Varela, Maria Ruiz ; Riesen, Rolf ; Roth, Philip C.

  • Author_Institution
    Sandia Nat. Labs, Livermore
  • fYear
    2007
  • fDate
    24-27 Sept. 2007
  • Firstpage
    30
  • Lastpage
    46
  • Abstract
    The next generation of capability-class, massively parallel processing (MPP) systems is expected to have hundreds of thousands of processors. For application-driven, periodic checkpoint operations, the state-of-the-art does not provide a solution that scales to next-generation systems. We demonstrate this by using mathematical modeling to compute a lower bound of the impact of these approaches on the performance of applications executed on three massive-scale, in-production, DOE systems and a theoretical petaflop system. We also adapt the model to investigate a proposed optimization that makes use of "lightweight" storage architectures and overlay networks to overcome the storage system bottleneck. Our results indicate that (1) as we approach the scale of next-generation systems, traditional checkpoint/restart approaches will increasingly impact application performance, accounting for over 50% of total application execution time; (2) although our alternative approach improves performance, it has limitations of its own; and (3) there is a critical need for new approaches to fault tolerance that allow continuous computing with minimal impact on application scalability.
  • Keywords
    checkpointing; memory architecture; parallel processing; application-driven periodic checkpoint operations; capability-class MPP systems; lightweight storage architectures; massive-scale in-production DOE systems; massively parallel processing systems; mathematical modeling; next-generation systems; overlay networks; petaflop system; Bandwidth; Computer networks; Contracts; Delay; Fault tolerance; Fault tolerant systems; Laboratories; Large-scale systems; Parallel processing; US Department of Energy;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Mass Storage Systems and Technologies, 2007. MSST 2007. 24th IEEE Conference on
  • Conference_Location
    San Diego, CA
  • Print_ISBN
    978-0-7695-3025-3
  • Type

    conf

  • DOI
    10.1109/MSST.2007.4367962
  • Filename
    4367962