Title :
SWIM: scalable weakly-consistent infection-style process group membership protocol
Author :
Das, Abhinandan ; Gupta, Indranil ; Motivala, Ashish
Author_Institution :
Dept. of Comput. Sci., Cornell Univ., Ithaca, NY, USA
Abstract :
Several distributed peer-to-peer applications require weakly-consistent knowledge of process group membership information at all participating processes. SWIM is a generic software module that offers this service for large scale process groups. The SWIM effort is motivated by the unscalability of traditional heart-beating protocols, which either impose network loads that grow quadratically with group size, or compromise response times or false positive frequency w.r.t. detecting process crashes. This paper reports on the design, implementation and performance of the SWIM sub-system on a large cluster of commodity PCs. Unlike traditional heart beating protocols, SWIM separates the failure detection and membership update dissemination functionalities of the membership protocol. Processes are monitored through an efficient peer-to-peer periodic randomized probing protocol. Both the expected time to first detection of each process failure, and the expected message load per member do not vary with group size. Information about membership changes, such as process joins, drop-outs and failures, is propagated via piggybacking on ping messages and acknowledgments. This results in a robust and fast infection style (also epidemic or gossip-style) of dissemination. The rate of false failure detections in the SWIM system is reduced by modifying the protocol to allow group members to suspect a process before declaring it as failed - this allows the system to discover and rectify false failure detections. Finally, the protocol guarantees a deterministic time bound to detect failures. Experimental results from the SWIM prototype are presented. We discuss the extensibility of the design to a WAN-wide scale.
Keywords :
computer network reliability; protocols; wide area networks; workstation clusters; PC cluster; SWIM; WAN; deterministic time bound; distributed peer-to-peer applications; experimental results; failure detection; generic software module; heart-beating protocol; membership update; network loads; performance; periodic randomized probing protocol; piggybacking; ping messages; process group membership protocol; response times; scalable weakly-consistent infection-style protocol; Application software; Computer crashes; Condition monitoring; Delay; Frequency; Large-scale systems; Peer to peer computing; Personal communication networks; Protocols; Robustness;
Conference_Titel :
Dependable Systems and Networks, 2002. DSN 2002. Proceedings. International Conference on
Print_ISBN :
0-7695-1101-5
DOI :
10.1109/DSN.2002.1028914