DocumentCode
1690847
Title
Ovis-2: A robust distributed architecture for scalable RAS
Author
Brandt, J.M. ; Debusschere, B.J. ; Gentile, A.C. ; Mayo, J.R. ; Pébay, P.P. ; Thompson, D. ; Wong, M.H.
Author_Institution
Sandia Nat. Labs., Livermore, CA
fYear
2008
Firstpage
1
Lastpage
8
Abstract
Resource utilization in High Performance Compute clusters can be improved by increased awareness of system state information. Sophisticated run-time characterization of system state in increasingly large clusters requires a scalable fault-tolerant RAS framework. In this paper we describe the architecture of OVIS-2 and how it meets these requirements. We describe some of the sophisticated statistical analysis, 3-D visualization, and use cases for these. Using this framework and associated tools allows the engineer to explore the behaviors and complex interactions of low level system elements while simultaneously giving the system administrator their desired level of detail with respect to ongoing system and component health.
Keywords
resource allocation; system monitoring; workstation clusters; 3D visualization; Ovis-2; high performance compute clusters; resource utilization; robust distributed architecture; run-time characterization; scalable RAS; scalable fault-tolerant RAS framework; statistical analysis; system state information; Computer architecture; Displays; Failure analysis; Fault tolerance; Fault tolerant systems; Monitoring; Resource management; Robustness; Statistical analysis; US Department of Energy; RAS; cluster monitoring; distributed analysis; failure prediction; fault-tolerance; scalable analysis;
fLanguage
English
Publisher
ieee
Conference_Titel
Parallel and Distributed Processing, 2008. IPDPS 2008. IEEE International Symposium on
Conference_Location
Miami, FL
ISSN
1530-2075
Print_ISBN
978-1-4244-1693-6
Electronic_ISBN
1530-2075
Type
conf
DOI
10.1109/IPDPS.2008.4536549
Filename
4536549
Link To Document