• DocumentCode
    1690847
  • Title

    Ovis-2: A robust distributed architecture for scalable RAS

  • Author

    Brandt, J.M. ; Debusschere, B.J. ; Gentile, A.C. ; Mayo, J.R. ; Pébay, P.P. ; Thompson, D. ; Wong, M.H.

  • Author_Institution
    Sandia Nat. Labs., Livermore, CA
  • fYear
    2008
  • Firstpage
    1
  • Lastpage
    8
  • Abstract
    Resource utilization in High Performance Compute clusters can be improved by increased awareness of system state information. Sophisticated run-time characterization of system state in increasingly large clusters requires a scalable fault-tolerant RAS framework. In this paper we describe the architecture of OVIS-2 and how it meets these requirements. We describe some of the sophisticated statistical analysis, 3-D visualization, and use cases for these. Using this framework and associated tools allows the engineer to explore the behaviors and complex interactions of low level system elements while simultaneously giving the system administrator their desired level of detail with respect to ongoing system and component health.
  • Keywords
    resource allocation; system monitoring; workstation clusters; 3D visualization; Ovis-2; high performance compute clusters; resource utilization; robust distributed architecture; run-time characterization; scalable RAS; scalable fault-tolerant RAS framework; statistical analysis; system state information; Computer architecture; Displays; Failure analysis; Fault tolerance; Fault tolerant systems; Monitoring; Resource management; Robustness; Statistical analysis; US Department of Energy; RAS; cluster monitoring; distributed analysis; failure prediction; fault-tolerance; scalable analysis;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Parallel and Distributed Processing, 2008. IPDPS 2008. IEEE International Symposium on
  • Conference_Location
    Miami, FL
  • ISSN
    1530-2075
  • Print_ISBN
    978-1-4244-1693-6
  • Electronic_ISBN
    1530-2075
  • Type

    conf

  • DOI
    10.1109/IPDPS.2008.4536549
  • Filename
    4536549