Title :
A proactive fault-detection mechanism in large-scale cluster systems
Author :
Linping, Wu ; Dan, Meng ; Wen, Gao ; Jianfeng, Zhan
Author_Institution :
Inst. of Comput. Technol., Chinese Acad. of Sci., Beijing
Abstract :
To improve the whole dependability of large-scale cluster systems, an online fault detection mechanism is proposed in this paper. This mechanism can detect the fault in time before node fails and enables the proactive fault management. The proposed mechanism is summarized as follows: first, the dynamic characteristics of cluster system running in normal activity are built using time series analysis methods. Second, the fault detection process is implemented by comparing the current running state of cluster system with normal running model. The fault alarm decision is made immediately when the current running state deviates the normal running model. The experiment results show that this mechanism can detect the fault in cluster system in good time
Keywords :
fault tolerant computing; telecommunication network management; time series; workstation clusters; fault alarm decision; large-scale cluster systems; online fault detection mechanism; proactive fault management; proactive fault-detection mechanism; time series analysis; Aging; Computers; Fault detection; Hard disks; Large-scale systems; Monitoring; Operating systems; Power system management; Testing; Time series analysis;
Conference_Titel :
Parallel and Distributed Processing Symposium, 2006. IPDPS 2006. 20th International
Conference_Location :
Rhodes Island
Print_ISBN :
1-4244-0054-6
DOI :
10.1109/IPDPS.2006.1639332