DocumentCode :
745420
Title :
Checkpointing and Rollback-Recovery for Distributed Systems
Author :
Koo, Richard ; Toueg, Sam
Author_Institution :
Department of Computer Science, Cornell University
Issue :
1
fYear :
1987
Firstpage :
23
Lastpage :
31
Abstract :
We consider the problem of bringing a distributed system to a consistent state after transient failures. We address the two components of this problem by describing a distributed algorithm to create consistent checkpoints, as well as a rollback-recovery algorithm to recover the system to a consistent state. In contrast to previous algorithms, they tolerate failures that occur during their executions. Furthermore, when a process takes a checkpoint, a minimal number of additional processes are forced to take checkpoints. Similarly, when a process rolls back and restarts after a failure, a minimal number of additional processes are forced to roll back with it. Our algorithms require each process to store at most two checkpoints in stable storage. This storage requirement is shown to be minimal under general assumptions.
Keywords :
Checkpoint; consistent state; distributed systems; fault-tolerance; rollback-recovery; Checkpointing; Computer science; Distributed algorithms; Distributed computing; Fault tolerance; Fault tolerant systems; Hardware; Resumes; Checkpoint; consistent state; distributed systems; fault-tolerance; rollback-recovery;
fLanguage :
English
Journal_Title :
Software Engineering, IEEE Transactions on
Publisher :
ieee
ISSN :
0098-5589
Type :
jour
DOI :
10.1109/TSE.1987.232562
Filename :
1702129
Link To Document :
بازگشت