DocumentCode :
3183802
Title :
On node state reconstruction for fault tolerant distributed algorithms
Author :
Okun, Michael ; Barak, Amnon
Author_Institution :
Comput. Sci. Inst., Hebrew Univ., Jerusalem, Israel
fYear :
2002
fDate :
2002
Firstpage :
160
Lastpage :
168
Abstract :
One of the main methods for achieving fault tolerance in distributed systems is recovery of the state of failed components. Though generic recovery methods like checkpointing and message logging exist, in many cases the recovery has to be application specific. In this paper we propose a general model for a node state reconstruction after crash failures. In our model the reconstruction operation is defined only by the requirements it fulfills, without referring to the specific application dependent way it is performed. The model provides a framework for formal treatment of algorithm-specific and system-specific recovery procedures. It is used to specify node state reconstruction procedures for several widely used distributed algorithms and systems, as well as to prove their correctness.
Keywords :
distributed algorithms; software fault tolerance; system recovery; checkpointing; distributed algorithms; distributed systems; fault tolerance; message logging; node state reconstruction; recovery; state reconstruction; Algorithm design and analysis; Checkpointing; Computer crashes; Computer science; Distributed algorithms; Fault tolerance; Fault tolerant systems; Hardware; Software algorithms; Switches;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Reliable Distributed Systems, 2002. Proceedings. 21st IEEE Symposium on
ISSN :
1060-9857
Print_ISBN :
0-7695-1659-9
Type :
conf
DOI :
10.1109/RELDIS.2002.1180184
Filename :
1180184
Link To Document :
بازگشت