• DocumentCode
    1379410
  • Title

    Robust monitoring of network-wide aggregates through gossiping

  • Author

    Wuhib, Fetahi ; Dam, Mads ; Stadler, Rolf ; Clem, Alexander

  • Author_Institution
    ACCESS Linnaeus Center, KTH R. Inst. of Technol., Stockholm, Sweden
  • Volume
    6
  • Issue
    2
  • fYear
    2009
  • fDate
    6/1/2009 12:00:00 AM
  • Firstpage
    95
  • Lastpage
    109
  • Abstract
    We investigate the use of gossip protocols for continuous monitoring of network-wide aggregates under crash failures. Aggregates are computed from local management variables using functions such as SUM, MAX, or AVERAGE. For this type of aggregation, crash failures offer a particular challenge due to the problem of mass loss, namely, how to correctly account for contributions from nodes that have failed. In this paper we give a partial solution. We present G-GAP, a gossip protocol for continuous monitoring of aggregates, which is robust against failures that are discontiguous in the sense that neighboring nodes do not fail within a short period of each other. We give formal proofs of correctness and convergence, and we evaluate the protocol through simulation using real traces. The simulation results suggest that the design goals for this protocol have been met. For instance, the tradeoff between estimation accuracy and protocol overhead can be controlled, and a high estimation accuracy (below some 5% error in our measurements) is achieved by the protocol, even for large networks and frequent node failures. Further, we perform a comparative assessment of GGAP against a tree-based aggregation protocol using simulation. Surprisingly, we find that the tree-based aggregation protocol consistently outperforms the gossip protocol for comparative overhead, both in terms of accuracy and robustness.
  • Keywords
    distributed algorithms; monitoring; protocols; crash failures; gossip protocols; network-wide aggregates; robust monitoring; tree-based aggregation protocol; Aggregates; Computer crashes; Condition monitoring; Database systems; Distributed algorithms; Error correction; Fault tolerant systems; Helium; Protocols; Robustness; Gossip protocol, epidemic protocol, aggregation, real-time monitoring;
  • fLanguage
    English
  • Journal_Title
    Network and Service Management, IEEE Transactions on
  • Publisher
    ieee
  • ISSN
    1932-4537
  • Type

    jour

  • DOI
    10.1109/TNSM.2009.090603
  • Filename
    5374830