DocumentCode :
1997023
Title :
A Distributed Approach to Autonomous Fault Treatment in Spread
Author :
Meling, Hein ; Gilje, Joakim L.
Author_Institution :
Dept. of Electr. Eng. & Comput. Sci., Stavanger Univ., Stavanger
fYear :
2008
fDate :
7-9 May 2008
Firstpage :
46
Lastpage :
55
Abstract :
This paper presents the design and implementation of the distributed autonomous replication management (DARM) framework built on top of the Spread group communication system. The objective of DARM is to improve the dependability characteristics of systems through a fault treatment mechanism. Unlike many existing fault tolerance frameworks, DARM focuses on deployment and operational aspects, where the gain in terms of improved dependability is likely to be the greatest. DARM is novel in that recovery decisions are distributed to each individual group deployed in the system, eliminating the need for a centralized manager with global information about all groups. This scheme allows groups to perform fault treatment on themselves. A group leader in each group is responsible for fault treatment by means of replacing failed group members; the approach also tolerates failure of the group leader. The advantages of the distributed approach is: (i) no need to maintain globally centralized information about all groups which is costly and limits scalability, (ii) reduced infrastructure complexity, and (iii) less communication overhead. We evaluate the approach experimentally to validate its fault handling capability; the recovery performance of a system deployed in a local area network is evaluated. The results show that applications can recover to their initial system configuration in a very short period of time.
Keywords :
decision making; distributed processing; fault tolerant computing; system recovery; DARM; decision making; dependability characteristics; distributed autonomous replication management; fault handling capability; fault tolerance; fault treatment; local area network; spread group communication system; system recovery; Conference management; Distributed computing; Engineering management; Fault tolerance; Humans; Information management; Libraries; Local area networks; Scalability; Software maintenance; Distributed replication management; Group communication; Partition measurements; Recovery management;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Dependable Computing Conference, 2008. EDCC 2008. Seventh European
Conference_Location :
Kaunas
Print_ISBN :
978-0-7695-3138-0
Type :
conf
DOI :
10.1109/EDCC-7.2008.12
Filename :
4555989
Link To Document :
بازگشت