Title :
Proactive Failure Management for High Availability Computing in Computer Clusters
Author :
Zhang, Ziming ; Fu, Song
Author_Institution :
Dept. of Comput. Sci. & Eng., New Mexico Inst. of Min. & Technol., Socorro, NM, USA
Abstract :
In this paper, we propose a framework for autonomic failure management with hierarchical failure prediction functionality for coalition clusters. It analyzes node, cluster and system wide failure behaviors and forecasts the prospective failure occurrences based on quantified failure dynamics. Failure correlations are inspected by the predictor. Experimental results in a computational grid on campus show the offline and online predictions by our predictors accurately forecast the failure trend and capture failure correlations in a coalition clusters environment.
Keywords :
Availability; Conference management; Data analysis; Engineering management; Failure analysis; Grid computing; Large-scale systems; Performance analysis; Resource management; Technology management;
Conference_Titel :
Computational Science and Optimization (CSO), 2010 Third International Joint Conference on
Conference_Location :
Huangshan, Anhui, China
Print_ISBN :
978-1-4244-6812-6
Electronic_ISBN :
978-1-4244-6813-3
DOI :
10.1109/CSO.2010.71