DocumentCode :
1577951
Title :
A Meta-Learning Failure Predictor for Blue Gene/L Systems
Author :
Gujrati, Prashasta ; Li, Yawei ; Lan, Zhiling ; Thakur, Rajeev ; White, John
Author_Institution :
Illinois Inst. of Technol., Chicago, IL
fYear :
2007
Firstpage :
40
Lastpage :
40
Abstract :
The demand for more computational power in science and engineering has spurred the design and deployment of ever-growing cluster systems. Even though the individual components used in these systems are highly reliable, the presence of large number of components inevitably increases the failure probability of such systems. Successful prediction of potential failures can greatly enhance various fault tolerance mechanisms used in large clusters, thereby mitigating the adverse impact of failures on system productivity and total cost of ownership. In this paper, we present a three-phase failure predictor to automatically process RAS events and further discover failure patterns for prediction in Blue Gene/L systems. In particular, this paper explores the use of meta- learning to adoptively integrate base methods with the goal to boost prediction accuracy. Experiments with two RAS logs collected from Blue Gene/L systems at ANL and SDSC demonstrate the effectiveness of the proposed failure predictor.
Keywords :
fault tolerance; learning (artificial intelligence); parallel machines; Blue Gene/L systems; failure probability; fault tolerance mechanisms; meta-learning failure predictor; three-phase failure predictor; Accuracy; Costs; Fault tolerant systems; High performance computing; Power engineering and energy; Power engineering computing; Power system reliability; Reliability engineering; Resilience; Supercomputers;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Parallel Processing, 2007. ICPP 2007. International Conference on
Conference_Location :
Xi´an
ISSN :
0190-3918
Print_ISBN :
978-0-7695-2933-2
Type :
conf
DOI :
10.1109/ICPP.2007.9
Filename :
4343847
Link To Document :
بازگشت