DocumentCode :
166685
Title :
Digging deeper into cluster system logs for failure prediction and root cause diagnosis
Author :
Xiaoyu Fu ; Rui Ren ; Mckee, Sally A. ; Jianfeng Zhan ; Ninghui Sun
Author_Institution :
State Key Lab. Comput. Archit., Inst. of Comput. Technol., Beijing, China
fYear :
2014
fDate :
22-26 Sept. 2014
Firstpage :
103
Lastpage :
112
Abstract :
As the sizes of supercomputers and data centers grow towards exascale, failures become normal. System logs play a critical role in the increasingly complex tasks of automatic failure prediction and diagnosis. Many methods for failure prediction are based on analyzing event logs for large scale systems, but there is still neither a widely used one to predict failures based on both non-fatal and fatal events, nor a precise one that uses fine-grained information (such as failure type, node location, related application, and time of occurrence). A deeper and more precise log analysis technique is needed. We propose a three-step approach to draw out event dependencies and to identify failure-event generating processes. First, we cluster frequent event sequences into event groups based on common events. Then we infer causal dependencies between events in each event group. Finally, we extract failure rules based on the observation that events of the same event types, on the same nodes or from the same applications have similar operational behaviors. We use this rich information to improve failure prediction. Our approach semi-automates diagnosing the root causes of failure events, making it a valuable tool for system administrators.
Keywords :
fault tolerant computing; parallel processing; cluster system logs; data centers; event logs; failure prediction; failure-event generating process; log analysis technique; root cause diagnosis; supercomputers; Clustering algorithms; Correlation; Data mining; Inference algorithms; Kernel; Predictive models; Semantics; event causal dependency inference; failure prediction; large-scale cluster systems; root cause diagnosis;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Cluster Computing (CLUSTER), 2014 IEEE International Conference on
Conference_Location :
Madrid
Type :
conf
DOI :
10.1109/CLUSTER.2014.6968768
Filename :
6968768
Link To Document :
بازگشت