DocumentCode
2257550
Title
An algorithm for distributed hierarchical diagnosis of dynamic fault and repair events
Author
Duarte, Elias Procópio, Jr. ; Brawerman, Alessandro ; Albini, Luiz Carlos P
Author_Institution
Dept. Inf., Fed. Univ. of Parana, Curitiba, Brazil
fYear
2000
fDate
2000
Firstpage
299
Lastpage
306
Abstract
The components of a fault-tolerant distributed system must be capable to accurately determine which components of the system are faulty and which are fault-free. In this paper, we present a new distributed algorithm for event diagnosis in fully-connected networks. An event is defined as a faulty node becoming fault-free, or vice versa. Previous hierarchical algorithms considered a static fault situation, in which an event can only occur after a previous event has been fully diagnosed. The new algorithm is capable of achieving the diagnosis of dynamic events as long as the nodes stay in a given state for a period of time long enough for all testers to detect that state. Each node running the algorithm keeps a timestamp for the state of each other node in the system. This timestamp is implemented as a counter, which is incremented every time a node changes its state. In this way, each tester may obtain information about a given node in the system from more than one tested node without causing any inconsistencies, i.e. without taking an older state for a newer one. Nodes run a hierarchical testing strategy, which is a hypercube when all nodes are fault-free. When a fault-free node is tested, the tester gets diagnostic information about N/2 nodes for a system of N nodes. In spite of the overhead of keeping and transferring timestamps, the new algorithm significantly reduces the average latency when compared to other similar approaches, presenting a new option for practical diagnosis implementation
Keywords
distributed algorithms; fault diagnosis; fault tolerant computing; counter; distributed algorithm; distributed hierarchical diagnosis; dynamic events; dynamic fault events; dynamic repair events; event diagnosis; fault-free nodes; fault-tolerant distributed system; faulty component determination; fully-connected networks; hierarchical algorithms; hierarchical testing strategy; hypercube; latency; node state; overhead; timestamp; Adaptive systems; Counting circuits; Delay; Distributed algorithms; Event detection; Fault diagnosis; Fault tolerant systems; Informatics; Local area networks; System testing;
fLanguage
English
Publisher
ieee
Conference_Titel
Parallel and Distributed Systems, 2000. Proceedings. Seventh International Conference on
Conference_Location
Iwate
ISSN
1521-9097
Print_ISBN
0-7695-0568-6
Type
conf
DOI
10.1109/ICPADS.2000.857711
Filename
857711
Link To Document