Title :
Recent advances in checkpoint/recovery systems
Author :
Bronevetsky, Greg ; Fernandes, Rohit ; Marques, Daniel ; Pingali, Keshav ; Stodghill, Paul
Author_Institution :
Dept. of Comput. Sci., Cornell Univ., Ithaca, NY
Abstract :
Checkpoint and recovery (CPR) systems have many uses in high-performance computing. Because of this, many developers have implemented it, by hand, into their applications. One of the uses of checkpointing is to help mitigate the effects of interruptions in computational service (both planned and unplanned) In fact, some supercomputing centers expect their users to use checkpointing as a matter of policy. And yet, few centers provide fully automatic checkpointing systems for their high-end production machines. The paper is a status report on our work on the family of C3 systems for (almost) fully automatic checkpointing for scientific applications. To date, we have shown that our techniques can be used for checkpointing sequential, MPI and OpenMP applications written in C, Fortran, and several other languages. A novel aspect of our work is that we have not built a single checkpointing system, rather, we have developed a methodology and a set of techniques that have enabled us to develop a number of systems, each meeting different design goals and efficiency requirements
Keywords :
checkpointing; message passing; parallel machines; OpenMP; checkpointing system; computational service; high-end production machine; high-performance computing; message passing interface; recovery system; sequential application; supercomputing center; Application software; Checkpointing; Computer crashes; Computer science; Debugging; Fault tolerance; Hardware; Production systems; Resource management; Visualization;
Conference_Titel :
Parallel and Distributed Processing Symposium, 2006. IPDPS 2006. 20th International
Conference_Location :
Rhodes Island
Print_ISBN :
1-4244-0054-6
DOI :
10.1109/IPDPS.2006.1639575