• DocumentCode
    2898316
  • Title

    Evaluating availability under quasi-heavy-tailed repair times

  • Author

    Kato, Sei ; Osogami, Takayuki

  • Author_Institution
    Tokyo Res. Lab., IBM Res., Yamato
  • fYear
    2008
  • fDate
    24-27 June 2008
  • Firstpage
    442
  • Lastpage
    451
  • Abstract
    The time required to recover from failures has a great impact on the availability of information technology (IT) systems. We define a class of probability distributions named quasi-heavy-tailed distributions as those distributions whose time series graph of the sample mean shows intermittent jumps in a given period. We find that the distribution of repair time is quasi-heavy-tailed for three IT systems, an in-house system hosted by IBM, a high performance computing system at the Los Alamos National Laboratory, and a distributed memory computer at the National Energy Research Scientific Computing Center. This means that the mean time to repair estimated by observing incidents within a certain period could dramatically change if we observe incidents successively for another period. In other words, the estimated mean time to repair has large fluctuations over time. As a result, classical metrics based on the mean time to repair are not optimal for evaluating the availability of these systems. We propose to evaluate the availability of IT systems with the T-year return value, estimated based on extreme value theory. The T-year return value refers to the value that the repair time exceeds on average once every estimated T years. We find that the T-year return value is a sound metric of the availability of the three IT systems.
  • Keywords
    software maintenance; statistical distributions; system recovery; Los Alamos National Laboratory; National Energy Research Scientific Computing Center; distributed memory computer; extreme value theory; information technology; intermittent jumps; quasiheavy-tailed repair times; Distributed computing; Fluctuations; High performance computing; Information technology; Laboratories; Probability distribution; Robustness; Scientific computing; Statistical distributions; Time measurement;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Dependable Systems and Networks With FTCS and DCC, 2008. DSN 2008. IEEE International Conference on
  • Conference_Location
    Anchorage, AK
  • Print_ISBN
    978-1-4244-2397-2
  • Electronic_ISBN
    978-1-4244-2398-9
  • Type

    conf

  • DOI
    10.1109/DSN.2008.4630115
  • Filename
    4630115