DocumentCode
3424051
Title
User-triggered checkpointing: system-independent and scalable application recovery
Author
Deconinck, Geert ; Lauwereins, Rudy
Author_Institution
Electrotech. Dept., Katholieke Univ., Leuven, Heverlee, Belgium
fYear
1997
fDate
1-3 Jul 1997
Firstpage
418
Lastpage
423
Abstract
User-triggered checkpointing and rollback is proposed as a system-independent and flexible way to integrate backward error recovery in long-running, computation-intensive message-passing applications on large parallel multicomputers. It employs library calls to coordinate the checkpointing, allowing a non-blocking and scalable approach that requires no protocol to save a consistent state because the coordination among the processes is implicit. The explicit indication of the checkpoint contents (i.e. the items of which the state must be saved) allows one to significantly reduce the amount of checkpoint data and the overhead. In contrast to other checkpointing approaches, the implementation does not rely on system-dependent features (like saving register-values or communication status) to save the state. Instead, re-executing the first part of the application brings the system-specific items into a consistent state with the rest of the checkpoint contents that is restored from the saved checkpoint data
Keywords
fault tolerant computing; message passing; parallel machines; system recovery; backward error recovery; checkpoint contents; checkpoint data; message-passing applications; nonblocking approach; overhead reduction; parallel multicomputers; rollback; scalable application recovery; scalable approach; system-independent recovery; user-triggered checkpointing; Application software; Checkpointing; Computational modeling; Computer applications; Concurrent computing; High performance computing; Libraries; Power system modeling; Power system restoration; Protocols;
fLanguage
English
Publisher
ieee
Conference_Titel
Computers and Communications, 1997. Proceedings., Second IEEE Symposium on
Conference_Location
Alexandria
Print_ISBN
0-8186-7852-6
Type
conf
DOI
10.1109/ISCC.1997.616035
Filename
616035
Link To Document