• DocumentCode
    3348564
  • Title

    Failure Detection in Large Scale Systems: a Survey

  • Author

    Pasin, Marcia ; Fontaine, Stéphane ; Bouchenak, Sara

  • Author_Institution
    Lab. de Sist. de Comput., Fed. Univ. of Santa Maria, Santa Maria
  • fYear
    2008
  • fDate
    7-11 April 2008
  • Firstpage
    165
  • Lastpage
    168
  • Abstract
    Failure detection is a basic service for building dependable systems. The large scale distribution of computing systems naturally makes failure detectors much harder to build. Moreover, providing QoS (quality of service) guarantees in this context is a challenging task. The objective of this paper is twofold: (1) proposing a complete set of classification criteria to compare different failure detection mechanisms, and based on these criteria (2) surveying the main failure detection solutions for large scale distributed systems.
  • Keywords
    distributed processing; fault tolerant computing; large-scale systems; quality of service; QoS; failure detection; large scale distributed systems; large scale systems; quality of service; Best practices; Condition monitoring; Context-aware services; Detectors; Distributed computing; Fault detection; Heart beat; Large-scale systems; Quality of service; Scalability;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Network Operations and Management Symposium Workshops, 2008. NOMS Workshops 2008. IEEE
  • Conference_Location
    Salvador da Bahia
  • Print_ISBN
    978-1-4244-2067-4
  • Type

    conf

  • DOI
    10.1109/NOMSW.2007.28
  • Filename
    4509944