• DocumentCode
    1348656
  • Title

    Reliability of a System of k Nodes for High Performance Computing Applications

  • Author

    Gottumukkala, Narasimha Raju ; Nassar, Raja ; Paun, Mihaela ; Leangsuksun, Chokchai Box ; Scott, Stephen L.

  • Volume
    59
  • Issue
    1
  • fYear
    2010
  • fDate
    3/1/2010 12:00:00 AM
  • Firstpage
    162
  • Lastpage
    169
  • Abstract
    Reliability estimation of High Performance Computing (HPC) systems enables resource allocation, and fault tolerance frameworks to minimize the performance loss due to unexpected failures. Recent studies have shown that compute nodes in HPC systems follow a time varying failure rate distribution such as Weibull, instead of the exponential distribution. In this paper, we propose a model for the Time to Failure (TTF) distribution of a system of k s-independent nodes when individual nodes exhibit time varying failure rates. We also present the system reliability, failure rates, Mean Time to Failure (MTTF), and derivations of the proposed system TTF model. The model is validated using observed data on time to failure.
  • Keywords
    Weibull distribution; distributed algorithms; fault tolerant computing; resource allocation; HPC systems; TTF distribution; Weibull distribution; exponential distribution; failure rate distribution; failure rates; fault tolerance; high performance computing; k independent nodes; mean time-to-failure; resource allocation; system reliability; time-to-failure; System reliability; Weibull distribution; system time to failure;
  • fLanguage
    English
  • Journal_Title
    Reliability, IEEE Transactions on
  • Publisher
    ieee
  • ISSN
    0018-9529
  • Type

    jour

  • DOI
    10.1109/TR.2009.2034291
  • Filename
    5345696