• DocumentCode
    378760
  • Title

    An adaptive failure detection protocol

  • Author

    Fetzer, Christof ; Raynal, Michel ; Tronel, Frederic

  • Author_Institution
    AT&T, Florham Park, NJ, USA
  • fYear
    2001
  • fDate
    2001
  • Firstpage
    146
  • Lastpage
    153
  • Abstract
    The detection of process failures is a crucial problem system designers have to cope with in order to build fault-tolerant distributed platforms. Unfortunately, it is impossible to distinguish with certainty a crashed process from a very slow process in a purely asynchronous distributed system. This prevents some problems from being solved in such systems. That is why failure detector oracles have been introduced to circumvent these impossibility results. The paper presents a relatively simple protocol that allows a process to "monitor" another process, and consequently to detect its crash. This protocol relies as much as possible on application messages to do this monitoring. Different from previous process crash detection protocols, it uses control messages only when no application message is sent by the monitoring process to the observed process. When the underlying system satisfies the partial synchrony assumption, it actually implements an eventually perfect failure detector (i.e., a failure detector of the class usually denoted OP). Moreover if the average observed transmission delay is finite and the upper layer application terminates within a bounded number of steps for any failure detector in OP after the failure detector becomes "perfect", then, when run with the proposed protocol, it also terminates correctly. These properties make the protocol inexpensive, implementable, and powerful. The paper also describes performance measurements of an implementation of the protocol
  • Keywords
    distributed processing; fault diagnosis; fault tolerant computing; protocols; system recovery; adaptive failure detection protocol; application messages; average observed transmission delay; control messages; crashed process; failure detector oracles; faul-tolerant distributed platforms; monitoring process; observed process; partial synchrony assumption; perfect failure detector; performance measurements; process crash detection protocols; process failure detection; purely asynchronous distributed system; simple protocol; system designers; upper layer application; very slow process; Computer crashes; Condition monitoring; Delay; Detectors; Fault detection; Fault tolerance; Fault tolerant systems; Measurement; Middleware; Protocols;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Dependable Computing, 2001. Proceedings. 2001 Pacific Rim International Symposium on
  • Conference_Location
    Seoul
  • Print_ISBN
    0-7695-1414-6
  • Type

    conf

  • DOI
    10.1109/PRDC.2001.992691
  • Filename
    992691