• DocumentCode
    2052323
  • Title

    Evaluating cooperative checkpointing for supercomputing systems

  • Author

    Oliner, Adam ; Sahoo, Ramendra

  • Author_Institution
    Dept. of Comput. Sci., Stanford Univ., Palo Alto, CA
  • fYear
    2006
  • fDate
    25-29 April 2006
  • Abstract
    Cooperative checkpointing, in which the system dynamically skips checkpoints requested by applications at runtime, can exploit system-level information to improve performance and reliability in the face of failures. We evaluate the applicability of cooperative checkpointing to large-scale systems through simulation studies considering real workloads, failure logs, and different network topologies. We consider two cooperative checkpointing algorithms: work-based cooperative checkpointing uses a heuristic based on the amount of unsaved work and risk-based cooperative checkpointing leverages failure event prediction. Our results demonstrate that, compared to periodic checkpointing, risk-based checkpointing with event prediction accuracy as low as 10% is able to significantly improve system utilization and reduce average bounded slowdown by a factor of 9, without losing any additional work to failures. Similarly, work-based checkpointing conferred tremendous performance benefits in the face of large checkpoint overheads
  • Keywords
    checkpointing; fault tolerant computing; parallel machines; cooperative checkpointing evaluation; failure event prediction; periodic checkpointing; risk-based checkpointing; risk-based cooperative checkpointing; supercomputing system; work-based cooperative checkpointing; Application software; Checkpointing; Computer network reliability; Computer science; Costs; Large-scale systems; Network topology; Performance analysis; Runtime; System performance;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Parallel and Distributed Processing Symposium, 2006. IPDPS 2006. 20th International
  • Conference_Location
    Rhodes Island
  • Print_ISBN
    1-4244-0054-6
  • Type

    conf

  • DOI
    10.1109/IPDPS.2006.1639693
  • Filename
    1639693