DocumentCode
2535546
Title
Checkpointing vs. Migration for Post-Petascale Supercomputers
Author
Cappello, Franck ; Casanova, H. ; Robert, Yves
Author_Institution
INRIA, Illinois Joint Lab. for Petascale Comput., Urbana-Champain, IL, USA
fYear
2010
fDate
13-16 Sept. 2010
Firstpage
168
Lastpage
177
Abstract
An alternative to classical fault-tolerant approaches for large-scale clusters is failure avoidance, by which the occurrence of a fault is predicted and a preventive measure is taken. We develop analytical performance models for two types of preventive measures: preventive checkpointing and preventive migration. We also develop an analytical model of the performance of a standard periodic checkpoint fault-tolerant approach. We instantiate these models for platform scenarios representative of current and future technology trends. We find that preventive migration is the better approach in the short term by orders of magnitude. However, in the longer term, both approaches have comparable merit with a marginal advantage for preventive checkpointing. We also find that standard non-prediction-based fault tolerance achieves poor scaling when compared to prediction-based failure avoidance, thereby demonstrating the importance of failure prediction capabilities. Finally, our results show that achieving good utilization in truly large-scale machines (e.g., 220 nodes) for parallel workloads will require more than the failure avoidance techniques evaluated in this work.
Keywords
checkpointing; fault tolerant computing; parallel machines; analytical performance models; large-scale clusters; large-scale machines; post-petascale supercomputer migration; prediction-based failure avoidance techniques; preventive checkpointing; preventive migration; standard nonprediction-based fault tolerance; standard periodic checkpoint fault-tolerant approach; Checkpointing; Fault tolerance; Fault tolerant systems; Mathematical model; Random access memory; Software; Throughput; checkpointing; failure prediction; migration; parallel jobs;
fLanguage
English
Publisher
ieee
Conference_Titel
Parallel Processing (ICPP), 2010 39th International Conference on
Conference_Location
San Diego, CA
ISSN
0190-3918
Print_ISBN
978-1-4244-7913-9
Electronic_ISBN
0190-3918
Type
conf
DOI
10.1109/ICPP.2010.26
Filename
5599161
Link To Document