Title :
Dependable initialization of large-scale distributed software
Author :
Ren, Yansong Jennifer ; Buskens, Rick ; Gonzalez, Oscar
Author_Institution :
Bell Labs., Lucent Technol., Murray Hill, NJ, USA
fDate :
28 June-1 July 2004
Abstract :
Most documented efforts in fault-tolerant computing address the problem of recovering from failures that occur during normal system operation. To bring a system to a point where it can begin performing its duties first requires that the system successfully complete initialization. Large-scale distributed systems may take hours to initialize. For such systems, a key challenge is tolerating failures that occur during initialization, while still completing initialization in a timely manner. In this paper, we present a dependable initialization model that captures the architecture of the system to be initialized, as well as interdependencies among system components. We show that overall system initialization may sometimes complete more quickly if recovery actions are deferred as opposed to commencing recovery actions as soon as a failure is detected. This observation leads us to introduce a recovery decision function that dynamically assesses when to take recovery actions. We then describe a dependable initialization algorithm that combines the dependable initialization model and the recovery decision function for achieving fast initialization. Experimental results show that our algorithm incurs lower initialization overhead than that of a conventional initialization algorithm. This work is the first effort we are aware of that formally studies the challenges of initializing a distributed system in the presence of failures.
Keywords :
distributed processing; fault tolerant computing; large-scale systems; system recovery; distributed software; failure detection; failure recovery; fault-tolerant computing; initialization overhead; large-scale systems; recovery decision function; Checkpointing; Communication channels; Databases; Fault tolerance; Fault tolerant systems; Grid computing; Large-scale systems;
Conference_Titel :
Dependable Systems and Networks, 2004 International Conference on
Print_ISBN :
0-7695-2052-9
DOI :
10.1109/DSN.2004.1311903