DocumentCode :
244273
Title :
Lazy Checkpointing: Exploiting Temporal Locality in Failures to Mitigate Checkpointing Overheads on Extreme-Scale Systems
Author :
Tiwari, D. ; Gupta, Swastik ; Vazhkudai, Sudharshan S.
Author_Institution :
Oak Ridge Nat. Lab., Oak Ridge, TN, USA
fYear :
2014
fDate :
23-26 June 2014
Firstpage :
25
Lastpage :
36
Abstract :
Continuing increase in the computational power of supercomputers has enabled large-scale scientific applications in the areas of astrophysics, fusion, climate and combustion to run larger and longer-running simulations, facilitating deeper scientific insights. However, these long-running simulations are often interrupted by multiple system failures. Therefore, these applications rely on "check pointing\´" as a resilience mechanism to store application state to permanent storage and recover from failures. Unfortunately, check pointing incurs excessive I/O overhead on supercomputers due to large size of checkpoints, resulting in a sub-optimal performance and resource utilization. In this paper, we devise novel mechanisms to show how check pointing overhead can be mitigated significantly by exploiting the temporal characteristics of system failures. We provide new insights and detailed quantitative understanding of the check pointing overheads and trade-offs on large-scale machines. Our prototype implementation shows the viability of our approach on extreme-scale machines.
Keywords :
checkpointing; parallel machines; application state storage; astrophysics; checkpoint size; checkpointing overhead mitigation; climate; combustion; computational power; excessive I/O overhead; extreme-scale machines; extreme-scale systems; failure recovery; failure temporal locality; fusion; large-scale machines; large-scale scientific applications; lazy checkpointing; long-running simulations; multiple system failures; permanent storage; resilience mechanism; resource utilization; supercomputers; system failure temporal characteristics; Analytical models; Bandwidth; Checkpointing; Computational modeling; Exponential distribution; Lead; Supercomputers; checkpointing; extreme-scale; locality; resilience; storage; supercomputing; system failures;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Dependable Systems and Networks (DSN), 2014 44th Annual IEEE/IFIP International Conference on
Conference_Location :
Atlanta, GA
Type :
conf
DOI :
10.1109/DSN.2014.101
Filename :
6903564
Link To Document :
بازگشت