Title :
Failure Detection in Large Scale Systems: a Survey
Author :
Pasin, Marcia ; Fontaine, Stéphane ; Bouchenak, Sara
Author_Institution :
Lab. de Sist. de Comput., Fed. Univ. of Santa Maria, Santa Maria
Abstract :
Failure detection is a basic service for building dependable systems. The large scale distribution of computing systems naturally makes failure detectors much harder to build. Moreover, providing QoS (quality of service) guarantees in this context is a challenging task. The objective of this paper is twofold: (1) proposing a complete set of classification criteria to compare different failure detection mechanisms, and based on these criteria (2) surveying the main failure detection solutions for large scale distributed systems.
Keywords :
distributed processing; fault tolerant computing; large-scale systems; quality of service; QoS; failure detection; large scale distributed systems; large scale systems; quality of service; Best practices; Condition monitoring; Context-aware services; Detectors; Distributed computing; Fault detection; Heart beat; Large-scale systems; Quality of service; Scalability;
Conference_Titel :
Network Operations and Management Symposium Workshops, 2008. NOMS Workshops 2008. IEEE
Conference_Location :
Salvador da Bahia
Print_ISBN :
978-1-4244-2067-4
DOI :
10.1109/NOMSW.2007.28