DocumentCode
2052451
Title
OVIS: a tool for intelligent, real-time monitoring of computational clusters
Author
Brandt, J.M. ; Gentile, A.C. ; Hale, D.J. ; Pébay, P.P.
Author_Institution
Sandia Nat. Labs., Livermore, CA
fYear
2006
fDate
25-29 April 2006
Abstract
Traditional cluster monitoring approaches consider nodes in singleton, using manufacturer-specified extreme limits as thresholds for failure "prediction". We have developed a tool, OVIS, for monitoring and analysis of large computational platforms which, instead, uses a statistical approach to characterize single device behaviors from those of a large number of statistically similar devices. Baseline capabilities of OVIS include the visual display of deterministic information about state variables (e.g., temperature, CPU utilization, fan speed) and their aggregate statistics. Visual consideration of the cluster as a comparative ensemble, rather than as singleton nodes, is an easy and useful method for tuning cluster configuration and determining effects of realtime changes. Additionally, OVIS incorporates a novel Bayesian inference scheme to dynamically infer models for the normal behavior of a system and to determine bounds on the probability of values evinced in the system. Individual node values that are unlikely given the current applicable model are flagged as aberrant. This can be a much earlier indicator of problems than waiting for the crossing of some threshold that is necessarily set high to preclude too many false alarms. We present OVIS and discuss its applications in cluster configuration and environmental tuning and to abnormality and problem discovery in our production clusters
Keywords
belief networks; inference mechanisms; system monitoring; workstation clusters; Bayesian inference; OVIS; abnormality detection; cluster monitoring software; computational clusters; large computational platforms; single device behaviors; software tool; Aggregates; Bayesian methods; Computational intelligence; Computer aided manufacturing; Condition monitoring; Displays; Laboratories; Statistical analysis; Temperature distribution; US Department of Energy; Bayesian inference; RAS; abnormality detection; cluster monitoring;
fLanguage
English
Publisher
ieee
Conference_Titel
Parallel and Distributed Processing Symposium, 2006. IPDPS 2006. 20th International
Conference_Location
Rhodes Island
Print_ISBN
1-4244-0054-6
Type
conf
DOI
10.1109/IPDPS.2006.1639698
Filename
1639698
Link To Document