Title :
Adaptive Failure Detection via Heartbeat under Hadoop
Author :
Zhu, Hao ; Chen, Haopeng
Author_Institution :
Sch. of Software, Shanghai Jiao Tong Univ., Shanghai, China
Abstract :
Hadoop has become one popular framework to process massive data sets in a large scale cluster. However, it is observed that the detection of the failed worker is delayed, which may result in a significant increase in the completion time of jobs with different workload. To cope with it, we present two mechanisms: Adaptive interval and Reputation-based Detector that support Hadoop to detect the failed worker in the shortest time. The Adaptive interval is trying to dynamically configure the expiration time which is adaptive to the job size. The Reputation-based Detector is trying to evaluate the reputation of each worker. Once the reputation of a worker is lower than a threshold, then the worker will be considered as a failed worker. In our experiments, we demonstrate that both of these strategies have achieved great improvement in the detection of the failed worker. Specifically, the Adaptive interval has a relatively better performance with small jobs, while the Reputation-based Detector is more suitable for large jobs.
Keywords :
distributed programming; software fault tolerance; Hadoop; adaptive failure detection; adaptive interval; failed worker; job size; large scale cluster; massive data set; reputation-based detector; Detectors; Educational institutions; Fault tolerance; Fault tolerant systems; Heart beat; Heart rate variability; Runtime; Cloud computing; Hadoop; MapReduce; adaptive heartbeat; failure detection;
Conference_Titel :
Services Computing Conference (APSCC), 2011 IEEE Asia-Pacific
Conference_Location :
Jeju Island
Print_ISBN :
978-1-4673-0206-7
DOI :
10.1109/APSCC.2011.46