DocumentCode
154163
Title
Characterizing the Impact of Rollback Avoidance at Extreme-Scale: A Modeling Approach
Author
Levy, Scott ; Ferreira, Kurt B. ; Bridges, Patrick G.
Author_Institution
Dept. of Comput. Sci., Univ. of New Mexico, Albuquerque, NM, USA
fYear
2014
fDate
9-12 Sept. 2014
Firstpage
401
Lastpage
410
Abstract
Resilience to failure is a key concern for next-generation high-performance computing systems. The dominant fault tolerance mechanism, coordinated checkpoint/restart, is projected to no longer be a viable option on these systems due to its predicted overheads. Rollback avoidance has the potential to prolong the viability of coordinated checkpoint/restart by allowing an application to make meaningful forward progress, perhaps with degraded performance, despite the occurrence or imminence of a failure. In this paper, we present two general analytic models for the performance of rollback avoidance techniques and validate these models against the performance of existing rollback avoidance techniques. We then use these models to evaluate the applicability of rollback avoidance for next-generation exascale systems. This includes analysis of exascale system design questions such as: (1) how effective must an application-specific rollback avoidance technique be to usefully augment checkpointing in an exascale system? (2) when is rollback avoidance on its own a viable alternative to coordinated checkpointing? and (3) how do rollback avoidance techniques and system characteristics interact to influence application performance?
Keywords
checkpointing; next generation networks; parallel processing; application-specific rollback avoidance technique; augment checkpointing; coordinated checkpoint/restart; exascale system design questions; extreme-scale; failure resilience; fault tolerance mechanism; next-generation exascale systems; next-generation high-performance computing systems; rollback avoidance techniques; system characteristics; Analytical models; Checkpointing; Computational modeling; Equations; Mathematical model; Predictive models; Runtime;
fLanguage
English
Publisher
ieee
Conference_Titel
Parallel Processing (ICPP), 2014 43rd International Conference on
Conference_Location
Minneapolis MN
ISSN
0190-3918
Type
conf
DOI
10.1109/ICPP.2014.49
Filename
6957249
Link To Document