Title :
OVIS: a tool for intelligent, real-time monitoring of computational clusters
Author :
Brandt, J.M. ; Gentile, A.C. ; Hale, D.J. ; Pébay, P.P.
Author_Institution :
Sandia Nat. Labs., Livermore, CA
Abstract :
Traditional cluster monitoring approaches consider nodes in singleton, using manufacturer-specified extreme limits as thresholds for failure "prediction". We have developed a tool, OVIS, for monitoring and analysis of large computational platforms which, instead, uses a statistical approach to characterize single device behaviors from those of a large number of statistically similar devices. Baseline capabilities of OVIS include the visual display of deterministic information about state variables (e.g., temperature, CPU utilization, fan speed) and their aggregate statistics. Visual consideration of the cluster as a comparative ensemble, rather than as singleton nodes, is an easy and useful method for tuning cluster configuration and determining effects of realtime changes. Additionally, OVIS incorporates a novel Bayesian inference scheme to dynamically infer models for the normal behavior of a system and to determine bounds on the probability of values evinced in the system. Individual node values that are unlikely given the current applicable model are flagged as aberrant. This can be a much earlier indicator of problems than waiting for the crossing of some threshold that is necessarily set high to preclude too many false alarms. We present OVIS and discuss its applications in cluster configuration and environmental tuning and to abnormality and problem discovery in our production clusters
Keywords :
belief networks; inference mechanisms; system monitoring; workstation clusters; Bayesian inference; OVIS; abnormality detection; cluster monitoring software; computational clusters; large computational platforms; single device behaviors; software tool; Aggregates; Bayesian methods; Computational intelligence; Computer aided manufacturing; Condition monitoring; Displays; Laboratories; Statistical analysis; Temperature distribution; US Department of Energy; Bayesian inference; RAS; abnormality detection; cluster monitoring;
Conference_Titel :
Parallel and Distributed Processing Symposium, 2006. IPDPS 2006. 20th International
Conference_Location :
Rhodes Island
Print_ISBN :
1-4244-0054-6
DOI :
10.1109/IPDPS.2006.1639698