• DocumentCode
    1147656
  • Title

    Checkpointing for peta-scale systems: a look into the future of practical rollback-recovery

  • Author

    Elnozahy, Elmootazbellah N. ; Plank, James S.

  • Author_Institution
    IBM Res., Austin, TX, USA
  • Volume
    1
  • Issue
    2
  • fYear
    2004
  • Firstpage
    97
  • Lastpage
    108
  • Abstract
    Over the past two decades, rollback-recovery via checkpoint-restart has been used with reasonable success for long-running applications, such as scientific workloads that take from few hours to few months to complete. Currently, several commercial systems and publicly available libraries exist to support various flavors of checkpointing. Programmers typically use these systems if they are satisfactory or otherwise embed checkpointing support themselves within the application. In this paper, we project the performance and functionality of checkpointing algorithms and systems as we know them today into the future. We start by surveying the current technology roadmap and particularly how Peta-Flop capable systems may be plausibly constructed in the next few years. We consider how rollback-recovery as practiced today will fare when systems may have to be constructed out of thousands of nodes. Our projections predict that, unlike current practice, the effect of rollback-recovery may play a more prominent role in how systems may be configured to reach the desired performance level. System planners may have to devote additional resources to enable rollback-recovery and the current practice of using "cheap commodity" systems to form large-scale clusters may face serious obstacles. We suggest new avenues for research to react to these trends.
  • Keywords
    distributed programming; fault tolerant computing; multiprocessing systems; software reliability; system recovery; cheap commodity systems; checkpoint restart; commercial systems; distributed applications; distributed systems; fault tolerance; large-scale clusters; multiple-processor systems; petaflop capable systems; petascale system checkpointing; public libraries; rollback recovery; scientific workloads; system configuration; system reliability; Application software; Availability; Checkpointing; Computer Society; Computer applications; Fault tolerant systems; Large-scale systems; Libraries; Programming profession; System performance; 65; Index Terms- Distributed systems; availability; distributed applications; evaluation; fault tolerance; measurement; modeling; modeling techniques; performance of systems; reliability; serviceability; simulation of multiple-processor systems.;
  • fLanguage
    English
  • Journal_Title
    Dependable and Secure Computing, IEEE Transactions on
  • Publisher
    ieee
  • ISSN
    1545-5971
  • Type

    jour

  • DOI
    10.1109/TDSC.2004.15
  • Filename
    1350776