• DocumentCode
    3601333
  • Title

    Predicting Transient Downtime in Virtual Server Systems: An Efficient Sample Path Randomization Approach

  • Author

    Du, Anna Ye ; Das, Sanjukta ; Zhouhan Yang ; Chunming Qiao ; Ramesh, R.

  • Author_Institution
    Dept. of Manage. Sci. & Syst., State Univ. of New York, Buffalo, NY, USA
  • Volume
    64
  • Issue
    12
  • fYear
    2015
  • Firstpage
    3541
  • Lastpage
    3554
  • Abstract
    A central challenge in developing cloud datacenters Service Level Agreements is the estimation of downtime distribution of a set of provisioned servers over a service window, which is compounded by three facts. First, while steady-state probabilities have been derived for birth-death processes involving server failures and repairs, they could be highly inaccurate under transience. Furthermore, steady-state cannot be assured under typical service windows. Therefore, estimation of transient distributions is essential. Second, the processes of failures and repairs may follow any distribution and hence need to be extracted using system log data and modeled using appropriate general distributions. Third, downtime distributions over service windows depend on the number of servers and their deployment structure for a contract. We develop an efficient and generalized sample path randomization approach to precisely estimate transient probabilities under three different checkpointing strategies and three flexible failure distribution models. The estimators are unbiased, consistent, efficient and sufficient. Their asymptotic convergence is established. The estimation algorithms are computationally efficient in solving practical problems and yield rich information on transient system behaviors. The methodology is general and extensible to various server failure and repair processes characterized using birth-death modeling.
  • Keywords
    checkpointing; cloud computing; computer centres; contracts; probability; virtualisation; asymptotic convergence; birth-death modeling; birth-death processes; checkpointing strategies; cloud datacenter service level agreements; contract deployment structure; downtime distribution estimation; flexible failure distribution models; sample path randomization approach; server failure process; server repair process; service windows; steady-state probabilities; transient distribution estimation; transient downtime prediction; transient probabilities; transient system behaviors; virtual server systems; Cloud computing; Computational modeling; Maintenance engineering; Markov chains; Predictive models; Virtualization; Cloud computing; Markov chains; fault-tolerant systems; virtual infrastructure; virtual infrastructure.;
  • fLanguage
    English
  • Journal_Title
    Computers, IEEE Transactions on
  • Publisher
    ieee
  • ISSN
    0018-9340
  • Type

    jour

  • DOI
    10.1109/TC.2015.2394437
  • Filename
    7038208