DocumentCode
524656
Title
Proactive Failure Management for High Availability Computing in Computer Clusters
Author
Zhang, Ziming ; Fu, Song
Author_Institution
Dept. of Comput. Sci. & Eng., New Mexico Inst. of Min. & Technol., Socorro, NM, USA
Volume
1
fYear
2010
fDate
28-31 May 2010
Firstpage
377
Lastpage
381
Abstract
In this paper, we propose a framework for autonomic failure management with hierarchical failure prediction functionality for coalition clusters. It analyzes node, cluster and system wide failure behaviors and forecasts the prospective failure occurrences based on quantified failure dynamics. Failure correlations are inspected by the predictor. Experimental results in a computational grid on campus show the offline and online predictions by our predictors accurately forecast the failure trend and capture failure correlations in a coalition clusters environment.
Keywords
Availability; Conference management; Data analysis; Engineering management; Failure analysis; Grid computing; Large-scale systems; Performance analysis; Resource management; Technology management;
fLanguage
English
Publisher
ieee
Conference_Titel
Computational Science and Optimization (CSO), 2010 Third International Joint Conference on
Conference_Location
Huangshan, Anhui, China
Print_ISBN
978-1-4244-6812-6
Electronic_ISBN
978-1-4244-6813-3
Type
conf
DOI
10.1109/CSO.2010.71
Filename
5533049
Link To Document