• DocumentCode
    2684465
  • Title

    Trouble Dashboard: A Distributed Failure Monitoring System for High-End Computing

  • Author

    Do, Thanh ; Nguyen, Thuy ; Nguyen, Dung T. ; Nguyen, Hiep C. ; Shi, Weisong

  • Author_Institution
    Dept. of Inf. Syst., Hanoi Univ. of Technol., Hanoi, Vietnam
  • fYear
    2009
  • fDate
    13-17 July 2009
  • Firstpage
    1
  • Lastpage
    7
  • Abstract
    Failure management is crucial for high performance computing systems, especially when the complexity of applications and underlying infrastructure has grown sharply in recent years. In this paper, we present the design, implementation and experiment of trouble dashboard (TD), an adaptive, flexible, and low overhead failure monitoring system. Our goal is to provide a lightweight, scalable failure-monitoring tool for both application scientists and system managers. In TD, a set of APIs is provided for application scientists to control the behavior of their applications with flexibility when failures happen. System managers can use the tool to monitor the status of not only computing nodes and running tasks but also failures when they occur. Experiments show that TD incurs low overhead, and remains accurate and flexible enough to adapt to various applications.
  • Keywords
    application program interfaces; computational complexity; performance evaluation; system monitoring; system recovery; APIs; application program interfaces; distributed failure monitoring system; failure management; high performance computing systems; high-end computing; scalable failure-monitoring tool; trouble dashboard; Application software; Computer science; Computerized monitoring; Condition monitoring; Distributed computing; Hardware; High performance computing; Management information systems; Real time systems; Technology management;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Computing and Communication Technologies, 2009. RIVF '09. International Conference on
  • Conference_Location
    Da Nang
  • Print_ISBN
    978-1-4244-4566-0
  • Electronic_ISBN
    978-1-4244-4568-4
  • Type

    conf

  • DOI
    10.1109/RIVF.2009.5174661
  • Filename
    5174661