• DocumentCode
    3722947
  • Title

    Insights into the Diagnosis of System Failures from Cluster Message Logs

  • Author

    Edward Chuah;Arshad Jhumka;James C. Browne;Bill Barth;Sai Narasimhamurthy

  • Author_Institution
    Univ. of Texas at Austin, Austin, TX, USA
  • fYear
    2015
  • Firstpage
    225
  • Lastpage
    232
  • Abstract
    Large cluster systems are composed of complex, interacting hardware and software components. Components, or the interactions between components, may fail due to many different reasons, leading to the eventual failure of executing jobs. This paper investigates an open question about failure diagnosis: What are the characteristics of the errors that lead to cluster system failures? To this end, this paper gives a systematic process for identifying and characterizing the root-causes of failures. We applied an extended version of the FDiagV3 diagnostics toolkit to the log-files of the Ranger and Lonestar supercomputers. Our results show that: (i) failures were a result of recurrent issues and errors, (ii) a small set of nodes are associated with these issues and errors, and (iii) Ranger and Lonestar display similar sets of problems. FDiagV3 will be put in the public domain for support of failure diagnosis for large cluster systems in May, 2015.
  • Keywords
    "Software","Supercomputers","Production","Linux","Correlation","Electronic mail","Analytical models"
  • Publisher
    ieee
  • Conference_Titel
    Dependable Computing Conference (EDCC), 2015 Eleventh European
  • Type

    conf

  • DOI
    10.1109/EDCC.2015.19
  • Filename
    7371970