• DocumentCode
    628260
  • Title

    Reading between the lines of failure logs: Understanding how HPC systems fail

  • Author

    El-Sayed, Nosayba ; Schroeder, Bianca

  • Author_Institution
    Dept. of Comput. Sci., Univ. of Toronto, Toronto, ON, Canada
  • fYear
    2013
  • fDate
    24-27 June 2013
  • Firstpage
    1
  • Lastpage
    12
  • Abstract
    As the component count in supercomputing installations continues to increase, system reliability is becoming one of the major issues in designing HPC systems. These issues will become more challenging in future Exascale systems, which are predicted to include millions of CPU cores. Even with relatively reliable individual components, the sheer number of components will increase failure rates to unprecedented levels. Efficiently running those systems will require a good understanding of how different factors impact system reliability. In this paper we use a decade worth of field data made available by Los Alamos National Lab to study the impact of a diverse set of factors on the reliability of HPC systems. We provide insights into the nature of correlations between failures, and investigate the impact of factors, such as the power quality, temperature, fan and chiller reliability, system usage and utilization, and external factors, such as cosmic radiation, on system reliability.
  • Keywords
    multiprocessing systems; parallel processing; CPU cores; Exascale systems; HPC systems fail; Los Alamos National Lab; failure logs; supercomputing installations; Analytical models; Correlation; Hardware; Probability; Program processors; Reliability;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Dependable Systems and Networks (DSN), 2013 43rd Annual IEEE/IFIP International Conference on
  • Conference_Location
    Budapest
  • ISSN
    1530-0889
  • Print_ISBN
    978-1-4673-6471-3
  • Type

    conf

  • DOI
    10.1109/DSN.2013.6575356
  • Filename
    6575356