DocumentCode
3722947
Title
Insights into the Diagnosis of System Failures from Cluster Message Logs
Author
Edward Chuah;Arshad Jhumka;James C. Browne;Bill Barth;Sai Narasimhamurthy
Author_Institution
Univ. of Texas at Austin, Austin, TX, USA
fYear
2015
Firstpage
225
Lastpage
232
Abstract
Large cluster systems are composed of complex, interacting hardware and software components. Components, or the interactions between components, may fail due to many different reasons, leading to the eventual failure of executing jobs. This paper investigates an open question about failure diagnosis: What are the characteristics of the errors that lead to cluster system failures? To this end, this paper gives a systematic process for identifying and characterizing the root-causes of failures. We applied an extended version of the FDiagV3 diagnostics toolkit to the log-files of the Ranger and Lonestar supercomputers. Our results show that: (i) failures were a result of recurrent issues and errors, (ii) a small set of nodes are associated with these issues and errors, and (iii) Ranger and Lonestar display similar sets of problems. FDiagV3 will be put in the public domain for support of failure diagnosis for large cluster systems in May, 2015.
Keywords
"Software","Supercomputers","Production","Linux","Correlation","Electronic mail","Analytical models"
Publisher
ieee
Conference_Titel
Dependable Computing Conference (EDCC), 2015 Eleventh European
Type
conf
DOI
10.1109/EDCC.2015.19
Filename
7371970
Link To Document