Title :
Characterizing the Impact of Rollback Avoidance at Extreme-Scale: A Modeling Approach
Author :
Levy, Scott ; Ferreira, Kurt B. ; Bridges, Patrick G.
Author_Institution :
Dept. of Comput. Sci., Univ. of New Mexico, Albuquerque, NM, USA
Abstract :
Resilience to failure is a key concern for next-generation high-performance computing systems. The dominant fault tolerance mechanism, coordinated checkpoint/restart, is projected to no longer be a viable option on these systems due to its predicted overheads. Rollback avoidance has the potential to prolong the viability of coordinated checkpoint/restart by allowing an application to make meaningful forward progress, perhaps with degraded performance, despite the occurrence or imminence of a failure. In this paper, we present two general analytic models for the performance of rollback avoidance techniques and validate these models against the performance of existing rollback avoidance techniques. We then use these models to evaluate the applicability of rollback avoidance for next-generation exascale systems. This includes analysis of exascale system design questions such as: (1) how effective must an application-specific rollback avoidance technique be to usefully augment checkpointing in an exascale system? (2) when is rollback avoidance on its own a viable alternative to coordinated checkpointing? and (3) how do rollback avoidance techniques and system characteristics interact to influence application performance?
Keywords :
checkpointing; next generation networks; parallel processing; application-specific rollback avoidance technique; augment checkpointing; coordinated checkpoint/restart; exascale system design questions; extreme-scale; failure resilience; fault tolerance mechanism; next-generation exascale systems; next-generation high-performance computing systems; rollback avoidance techniques; system characteristics; Analytical models; Checkpointing; Computational modeling; Equations; Mathematical model; Predictive models; Runtime;
Conference_Titel :
Parallel Processing (ICPP), 2014 43rd International Conference on
Conference_Location :
Minneapolis MN
DOI :
10.1109/ICPP.2014.49