Title :
A Failure Detection Service for Internet-Based Multi-AS Distributed Systems
Author :
Moraes, Dionei M. ; Duarte, Elias P.
Author_Institution :
Dept. Inf., Fed. Univ. of Parana, Curitiba, Brazil
Abstract :
Failure detectors are one of the basic building blocks of fault-tolerant distributed systems. A failure detector is a distributed oracle that provides information about the state of processes of a distributed system. This work presents a failure detector service for Internet-based distributed systems that span multiple autonomous systems. The service is based on monitors which are capable of providing global process state information through a SNMP interface. A monitor executes on each network where processes are monitored. Monitors at different networks communicate across the Internet using Web Services. The system was implemented and evaluated for monitored processes running both at a single LAN and distributed throughout the world in Planet Lab. Experimental results are presented, showing CPU usage, failure detection latency, and mistake rate.
Keywords :
Web services; fault tolerant computing; local area networks; system recovery; CPU usage; Internet-based distributed systems; Internet-based multiAS distributed systems; LAN; Planet Lab; SNMP interface; Web services; autonomous systems; failure detection latency; failure detection service; failure detector service; fault-tolerant distributed systems; global process state information; Biomedical monitoring; Computer crashes; Detectors; Heart beat; Local area networks; Monitoring; Web services; Distributed Systems Management; Failure Detectors; Multi-AS Internet Systems; Process Management;
Conference_Titel :
Parallel and Distributed Systems (ICPADS), 2011 IEEE 17th International Conference on
Conference_Location :
Tainan
Print_ISBN :
978-1-4577-1875-5
DOI :
10.1109/ICPADS.2011.5