DocumentCode :
3026945
Title :
Performance implications of periodic checkpointing on large-scale cluster systems
Author :
Oliner, A.J. ; Sahoo, R.K. ; Moreira, J.E. ; Gupta, M.
Author_Institution :
Dept. of Electr. Eng. & Comput. Sci., MIT, Cambridge, MA, USA
fYear :
2005
fDate :
4-8 April 2005
Abstract :
Large-scale systems like BlueGene/L are susceptible to a number of software and hardware failures that can affect system performance. Periodic application checkpointing is a common technique for mitigating the amount of work lost due to job failures, but its effectiveness under realistic circumstances has not been studied. In this paper, we analyze the system-level performance of periodic application checkpointing using parameters similar to those projected for BlueGene/L systems. Our results reflect simulations on a toroidal interconnect architecture, using a real job log from a machine similar to BlueGene/L, and with a real failure distribution from a large-scale cluster. Our simulation studies investigate the impact of parameters such as checkpoint overhead and checkpoint interval on a number of performance metrics, including bounded slowdown, system utilization, and total work lost. The results suggest that periodic checkpointing may not be an effective way to improve the average bounded slowdown or average system utilization metrics, though it reduces the amount of work lost due to failures. We show that overzealous checkpointing with high overhead can amplify the effects of failures. The study also suggests that new metrics and checkpointing techniques may be required to effectively handle job failures on large-scale machines like BlueGene/L.
Keywords :
checkpointing; failure analysis; fault tolerant computing; parallel machines; software performance evaluation; BlueGene/L; large-scale cluster systems; periodic application checkpointing; system-level performance; toroidal interconnect architecture; Application software; Checkpointing; Frequency; Hardware; Large-scale systems; Measurement; Parallel machines; Performance analysis; Software performance; System performance;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Parallel and Distributed Processing Symposium, 2005. Proceedings. 19th IEEE International
Print_ISBN :
0-7695-2312-9
Type :
conf
DOI :
10.1109/IPDPS.2005.337
Filename :
1420276
Link To Document :
بازگشت