Title :
Practical online failure prediction for Blue Gene/P: Period-based vs event-driven
Author :
Yu, Li ; Zheng, Ziming ; Lan, Zhiling ; Coghlan, Susan
Author_Institution :
Dept. of Comput. Sci., Illinois Inst. of Technol., Chicago, IL, USA
Abstract :
To facilitate proactive fault management in large-scale systems such as IBM Blue Gene/P, online failure prediction is of paramount importance. While many techniques have been presented for online failure prediction, questions arise regarding two commonly used approaches: period-based and event-driven. Which one has better accuracy? What is the best observation window (i.e., the time interval used to collect evidence before making a prediction)? How does the lead time (i.e., the time interval from the prediction to the failure occurrence) impact prediction arruracy? To answer these questions, we analyze and compare period-based and event-driven prediction approaches via a Bayesian prediction model. We evaluate these prediction approaches, under a variety of testing parameters, by means of RAS logs collected from a production supercomputer at Argonne National Laboratory. Experimental results show that the period-based Bayesian model and the event-driven Bayesian model can achieve up to 65.0% and 83.8% prediction accuracy, respectively. Furthermore, our sensitivity study indicates that the event-driven approach seems more suitable for proactive fault management in large-scale systems like Blue Gene/P.
Keywords :
belief networks; fault tolerant computing; large-scale systems; prediction theory; system recovery; Argonne National Laboratory; IBM Blue Gene/P; RAS logs; event driven prediction approach; event-driven Bayesian model; large- scale systems; period based prediction approach; period- based Bayesian model; practical online failure prediction; proactive fault management; production supercomputer; Accuracy; Bayesian methods; Correlation; Laboratories; Large-scale systems; Monitoring; Random variables;
Conference_Titel :
Dependable Systems and Networks Workshops (DSN-W), 2011 IEEE/IFIP 41st International Conference on
Conference_Location :
Hong Kong
Print_ISBN :
978-1-4577-0374-4
Electronic_ISBN :
978-1-4577-0373-7
DOI :
10.1109/DSNW.2011.5958823