DocumentCode :
2052451
Title :
OVIS: a tool for intelligent, real-time monitoring of computational clusters
Author :
Brandt, J.M. ; Gentile, A.C. ; Hale, D.J. ; Pébay, P.P.
Author_Institution :
Sandia Nat. Labs., Livermore, CA
fYear :
2006
fDate :
25-29 April 2006
Abstract :
Traditional cluster monitoring approaches consider nodes in singleton, using manufacturer-specified extreme limits as thresholds for failure "prediction". We have developed a tool, OVIS, for monitoring and analysis of large computational platforms which, instead, uses a statistical approach to characterize single device behaviors from those of a large number of statistically similar devices. Baseline capabilities of OVIS include the visual display of deterministic information about state variables (e.g., temperature, CPU utilization, fan speed) and their aggregate statistics. Visual consideration of the cluster as a comparative ensemble, rather than as singleton nodes, is an easy and useful method for tuning cluster configuration and determining effects of realtime changes. Additionally, OVIS incorporates a novel Bayesian inference scheme to dynamically infer models for the normal behavior of a system and to determine bounds on the probability of values evinced in the system. Individual node values that are unlikely given the current applicable model are flagged as aberrant. This can be a much earlier indicator of problems than waiting for the crossing of some threshold that is necessarily set high to preclude too many false alarms. We present OVIS and discuss its applications in cluster configuration and environmental tuning and to abnormality and problem discovery in our production clusters
Keywords :
belief networks; inference mechanisms; system monitoring; workstation clusters; Bayesian inference; OVIS; abnormality detection; cluster monitoring software; computational clusters; large computational platforms; single device behaviors; software tool; Aggregates; Bayesian methods; Computational intelligence; Computer aided manufacturing; Condition monitoring; Displays; Laboratories; Statistical analysis; Temperature distribution; US Department of Energy; Bayesian inference; RAS; abnormality detection; cluster monitoring;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Parallel and Distributed Processing Symposium, 2006. IPDPS 2006. 20th International
Conference_Location :
Rhodes Island
Print_ISBN :
1-4244-0054-6
Type :
conf
DOI :
10.1109/IPDPS.2006.1639698
Filename :
1639698
Link To Document :
بازگشت