DocumentCode :
3023387
Title :
Optimizing checkpoint sizes in the C3 system
Author :
Marques, Daniel ; Bronevetsky, Greg ; Fernandes, Rohit ; Pingali, Keshav ; Stodghill, Paul
Author_Institution :
Dept. of Comput. Sci., Cornell Univ., Ithaca, NY, USA
fYear :
2005
fDate :
4-8 April 2005
Abstract :
The running times of many computational science applications are much longer than the mean-time-between-failures (MTBF) of current high-performance computing platforms. To run to completion, such applications must tolerate hardware failures. Checkpoint-and-rest art (CPR) is the most commonly used scheme for accomplishing this - the state of the computation is saved periodically on stable storage, and when a hardware failure is detected, the computation is restarted from the most recently saved state. Most automatic CPR, schemes in the literature can be classified as system-level checkpointing schemes because they take core-dump style snapshots of the computational state when all the processes are blocked at global barriers in the program. Unfortunately, a system that implements this style of checkpointing is tied to a particular platform amd cannot optimize the checkpointing process using application-specific knowledge. We are exploring an alternative called automatic application-level checkpointing. In our approach, programs are transformed by a pre-processor so that they become self-checkpointing and self-rest art able on any platform. In this paper, we evaluate a mechanism that utilizes application knowledge to minimize the amount of information saved in a checkpoint.
Keywords :
checkpointing; fault tolerance; parallel processing; automatic application-level checkpointing; checkpoint optimization; checkpoint-and-rest art; computational science application; high-performance computing platform; mean-time-between-failures; system-level checkpointing scheme; Application software; Automatic logic units; Checkpointing; Computer applications; Computer science; Concurrent computing; Hardware; Programming profession; Protocols; Software systems;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Parallel and Distributed Processing Symposium, 2005. Proceedings. 19th IEEE International
Print_ISBN :
0-7695-2312-9
Type :
conf
DOI :
10.1109/IPDPS.2005.316
Filename :
1420141
Link To Document :
بازگشت