• DocumentCode
    2257550
  • Title

    An algorithm for distributed hierarchical diagnosis of dynamic fault and repair events

  • Author

    Duarte, Elias Procópio, Jr. ; Brawerman, Alessandro ; Albini, Luiz Carlos P

  • Author_Institution
    Dept. Inf., Fed. Univ. of Parana, Curitiba, Brazil
  • fYear
    2000
  • fDate
    2000
  • Firstpage
    299
  • Lastpage
    306
  • Abstract
    The components of a fault-tolerant distributed system must be capable to accurately determine which components of the system are faulty and which are fault-free. In this paper, we present a new distributed algorithm for event diagnosis in fully-connected networks. An event is defined as a faulty node becoming fault-free, or vice versa. Previous hierarchical algorithms considered a static fault situation, in which an event can only occur after a previous event has been fully diagnosed. The new algorithm is capable of achieving the diagnosis of dynamic events as long as the nodes stay in a given state for a period of time long enough for all testers to detect that state. Each node running the algorithm keeps a timestamp for the state of each other node in the system. This timestamp is implemented as a counter, which is incremented every time a node changes its state. In this way, each tester may obtain information about a given node in the system from more than one tested node without causing any inconsistencies, i.e. without taking an older state for a newer one. Nodes run a hierarchical testing strategy, which is a hypercube when all nodes are fault-free. When a fault-free node is tested, the tester gets diagnostic information about N/2 nodes for a system of N nodes. In spite of the overhead of keeping and transferring timestamps, the new algorithm significantly reduces the average latency when compared to other similar approaches, presenting a new option for practical diagnosis implementation
  • Keywords
    distributed algorithms; fault diagnosis; fault tolerant computing; counter; distributed algorithm; distributed hierarchical diagnosis; dynamic events; dynamic fault events; dynamic repair events; event diagnosis; fault-free nodes; fault-tolerant distributed system; faulty component determination; fully-connected networks; hierarchical algorithms; hierarchical testing strategy; hypercube; latency; node state; overhead; timestamp; Adaptive systems; Counting circuits; Delay; Distributed algorithms; Event detection; Fault diagnosis; Fault tolerant systems; Informatics; Local area networks; System testing;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Parallel and Distributed Systems, 2000. Proceedings. Seventh International Conference on
  • Conference_Location
    Iwate
  • ISSN
    1521-9097
  • Print_ISBN
    0-7695-0568-6
  • Type

    conf

  • DOI
    10.1109/ICPADS.2000.857711
  • Filename
    857711