• DocumentCode
    2535546
  • Title

    Checkpointing vs. Migration for Post-Petascale Supercomputers

  • Author

    Cappello, Franck ; Casanova, H. ; Robert, Yves

  • Author_Institution
    INRIA, Illinois Joint Lab. for Petascale Comput., Urbana-Champain, IL, USA
  • fYear
    2010
  • fDate
    13-16 Sept. 2010
  • Firstpage
    168
  • Lastpage
    177
  • Abstract
    An alternative to classical fault-tolerant approaches for large-scale clusters is failure avoidance, by which the occurrence of a fault is predicted and a preventive measure is taken. We develop analytical performance models for two types of preventive measures: preventive checkpointing and preventive migration. We also develop an analytical model of the performance of a standard periodic checkpoint fault-tolerant approach. We instantiate these models for platform scenarios representative of current and future technology trends. We find that preventive migration is the better approach in the short term by orders of magnitude. However, in the longer term, both approaches have comparable merit with a marginal advantage for preventive checkpointing. We also find that standard non-prediction-based fault tolerance achieves poor scaling when compared to prediction-based failure avoidance, thereby demonstrating the importance of failure prediction capabilities. Finally, our results show that achieving good utilization in truly large-scale machines (e.g., 220 nodes) for parallel workloads will require more than the failure avoidance techniques evaluated in this work.
  • Keywords
    checkpointing; fault tolerant computing; parallel machines; analytical performance models; large-scale clusters; large-scale machines; post-petascale supercomputer migration; prediction-based failure avoidance techniques; preventive checkpointing; preventive migration; standard nonprediction-based fault tolerance; standard periodic checkpoint fault-tolerant approach; Checkpointing; Fault tolerance; Fault tolerant systems; Mathematical model; Random access memory; Software; Throughput; checkpointing; failure prediction; migration; parallel jobs;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Parallel Processing (ICPP), 2010 39th International Conference on
  • Conference_Location
    San Diego, CA
  • ISSN
    0190-3918
  • Print_ISBN
    978-1-4244-7913-9
  • Electronic_ISBN
    0190-3918
  • Type

    conf

  • DOI
    10.1109/ICPP.2010.26
  • Filename
    5599161