DocumentCode :
1122915
Title :
Low-latency, concurrent checkpointing for parallel programs
Author :
Li, Kai ; Naughton, Jeffrey F. ; Plank, James S.
Author_Institution :
Dept. of Comput. Sci., Princeton Univ., NJ, USA
Volume :
5
Issue :
8
fYear :
1994
fDate :
8/1/1994 12:00:00 AM
Firstpage :
874
Lastpage :
879
Abstract :
Presents the results of an implementation of several algorithms for checkpointing and restarting parallel programs on shared-memory multiprocessors. The algorithms are compared according to the metrics of overall checkpointing time, overhead imposed by the checkpointer on the target program, and amount of time during which the checkpointer interrupts the target program. The best algorithm measured achieves its efficiency through a variation of copy-on-write, which allows the most time-consuming operations of the checkpoint to be overlapped with the running of the program being checkpointed
Keywords :
fault tolerant computing; parallel programming; program diagnostics; software reliability; system recovery; backward error recovery; copy-on-write; efficiency; fault tolerance; interruption time; low latency concurrent checkpointing; metrics; overall checkpointing time; overhead; overlapping operations; parallel programs; program restarting; shared-memory multiprocessors; Benchmark testing; Central Processing Unit; Checkpointing; Computer science; Concurrent computing; Delay; Fault tolerance; Fault tolerant systems; Registers;
fLanguage :
English
Journal_Title :
Parallel and Distributed Systems, IEEE Transactions on
Publisher :
ieee
ISSN :
1045-9219
Type :
jour
DOI :
10.1109/71.298215
Filename :
298215
Link To Document :
بازگشت