• DocumentCode
    2052451
  • Title

    OVIS: a tool for intelligent, real-time monitoring of computational clusters

  • Author

    Brandt, J.M. ; Gentile, A.C. ; Hale, D.J. ; Pébay, P.P.

  • Author_Institution
    Sandia Nat. Labs., Livermore, CA
  • fYear
    2006
  • fDate
    25-29 April 2006
  • Abstract
    Traditional cluster monitoring approaches consider nodes in singleton, using manufacturer-specified extreme limits as thresholds for failure "prediction". We have developed a tool, OVIS, for monitoring and analysis of large computational platforms which, instead, uses a statistical approach to characterize single device behaviors from those of a large number of statistically similar devices. Baseline capabilities of OVIS include the visual display of deterministic information about state variables (e.g., temperature, CPU utilization, fan speed) and their aggregate statistics. Visual consideration of the cluster as a comparative ensemble, rather than as singleton nodes, is an easy and useful method for tuning cluster configuration and determining effects of realtime changes. Additionally, OVIS incorporates a novel Bayesian inference scheme to dynamically infer models for the normal behavior of a system and to determine bounds on the probability of values evinced in the system. Individual node values that are unlikely given the current applicable model are flagged as aberrant. This can be a much earlier indicator of problems than waiting for the crossing of some threshold that is necessarily set high to preclude too many false alarms. We present OVIS and discuss its applications in cluster configuration and environmental tuning and to abnormality and problem discovery in our production clusters
  • Keywords
    belief networks; inference mechanisms; system monitoring; workstation clusters; Bayesian inference; OVIS; abnormality detection; cluster monitoring software; computational clusters; large computational platforms; single device behaviors; software tool; Aggregates; Bayesian methods; Computational intelligence; Computer aided manufacturing; Condition monitoring; Displays; Laboratories; Statistical analysis; Temperature distribution; US Department of Energy; Bayesian inference; RAS; abnormality detection; cluster monitoring;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Parallel and Distributed Processing Symposium, 2006. IPDPS 2006. 20th International
  • Conference_Location
    Rhodes Island
  • Print_ISBN
    1-4244-0054-6
  • Type

    conf

  • DOI
    10.1109/IPDPS.2006.1639698
  • Filename
    1639698