• DocumentCode
    3549472
  • Title

    Ensembles of models for automated diagnosis of system performance problems

  • Author

    Zhang, Steve ; Cohen, Ira ; Goldszmidt, Moises ; Symons, Julie ; Fox, Armando

  • Author_Institution
    Stanford Univ., CA, USA
  • fYear
    2005
  • fDate
    28 June-1 July 2005
  • Firstpage
    644
  • Lastpage
    653
  • Abstract
    Violations of service level objectives (SLO) in Internet services are urgent conditions requiring immediate attention. Previously we explored (I. Cohen et al., 2004) an approach for identifying which low-level system properties were correlated to high-level SLO violations (the metric attribution problem). The approach is based on automatically inducing models from data using pattern recognition and probability modeling techniques. In this paper we extend our approach to adapt to changing workloads and external disturbances by maintaining an ensemble of probabilistic models, adding new models when existing ones do not accurately capture current system behavior. Using realistic workloads on an implemented prototype system, we show that the ensemble of models captures the performance behavior of the system accurately under changing workloads and conditions. We fuse information from the models in the ensemble to identify likely causes of the performance problem, with results comparable to those produced by an oracle that continuously changes the model based on advance knowledge of the workload. The cost of inducing new models and managing the ensembles is negligible, making our approach both immediately practical and theoretically appealing.
  • Keywords
    Internet; belief networks; fault diagnosis; fault tolerant computing; pattern recognition; probability; Bayesian model management; Internet services; automated diagnosis; pattern recognition; probability modeling techniques; self-healing systems; self-monitoring systems; service level objectives; statistical induction; system performance; Availability; Bayesian methods; Delay; Hardware; Pattern recognition; Sensor phenomena and characterization; Sensor systems; System performance; Web and internet services; Web server; Automated diagnosis; self-healing and selfmonitoring systems; statistical induction and Bayesian Model Management;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Dependable Systems and Networks, 2005. DSN 2005. Proceedings. International Conference on
  • Print_ISBN
    0-7695-2282-3
  • Type

    conf

  • DOI
    10.1109/DSN.2005.44
  • Filename
    1467838