DocumentCode :
2999562
Title :
Monitoring and Predicting Hardware Failures in HPC Clusters with FTB-IPMI
Author :
Rajachandrasekar, Raghunath ; Besseron, Xavier ; Panda, Dhabaleswar K.
Author_Institution :
Network-Based Comput. Lab., Ohio State Univ., Columbus, OH, USA
fYear :
2012
fDate :
21-25 May 2012
Firstpage :
1136
Lastpage :
1143
Abstract :
Fault-detection and prediction in HPC clusters and Cloud-computing systems are increasingly challenging issues. Several system middleware such as job schedulers and MPI implementations provide support for both reactive and proactive mechanisms to tolerate faults. These techniques rely on external components such as system logs and infrastructure monitors to provide information about hardware/software failure either through detection, or as a prediction. However, these middleware work in isolation, without disseminating the knowledge of faults encountered. In this context, we propose a light-weight multi-threaded service, namely FTB-IPMI, which provides distributed fault-monitoring using the Intelligent Platform Management Interface (IPMI) and coordinated propagation of fault information using the Fault-Tolerance Backplane (FTB). In essence, it serves as a middleman between system hardware and the software stack by translating raw hardware events to structured software events and delivering it to any interested component using a publish-subscribe framework. Fault-predictors and other decision-making engines that rely on distributed failure information can benefit from FTB-IPMI to facilitate proactive fault-tolerance mechanisms such as preemptive job migration. We have developed a fault-prediction engine within MVAPICH2, an RDMA-based MPI implementation, to demonstrate this capability. Failure predictions made by this engine are used to trigger migration of processes from failing nodes to healthy spare nodes, thereby providing resilience to the MPI application. Experimental evaluation clearly indicates that a single instance of FTB-IPMI can scale to several hundreds of nodes with a remarkably low resource-utilization footprint. A deployment of FTB-IPMI that services a cluster with 128 compute-nodes, sweeps the entire cluster and collects IPMI sensor information on CPU temperature, system voltages and fan speeds in about 0.75 seconds. The average CPU utilization of th- s service running on a single node is 0.35%.
Keywords :
computerised monitoring; decision making; fault diagnosis; fault tolerant computing; middleware; multi-threading; online front-ends; pattern clustering; resource allocation; user interfaces; CPU temperature; FTB-IPMI; HPC clusters; IPMI sensor information; Intelligent Platform Management Interface; MVAPICH2; RDMA-based MPI implementation; cloud computing systems; decision making engines; distributed failure information; distributed fault monitoring; external components; failing nodes; fan speeds; fault prediction engine; fault tolerance backplane; hardware failure monitoring; hardware failure prediction; hardware stack; hardware-software failure; healthy spare nodes; high-performance computing clusters; infrastructure monitors; job schedulers; light-weight multithreaded service; preemptive job migration; proactive fault tolerance mechanisms; proactive mechanisms; publish-subscribe framework; raw hardware events; reactive mechanisms; resource utilization; software stack; structured software events; system logs; system middleware; system voltages; Fault tolerance; Fault tolerant systems; Hardware; Libraries; Monitoring; Software; Temperature sensors; FTB. HPC Clusters; Fault detection; IPMI; coordinated fault propogation;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Parallel and Distributed Processing Symposium Workshops & PhD Forum (IPDPSW), 2012 IEEE 26th International
Conference_Location :
Shanghai
Print_ISBN :
978-1-4673-0974-5
Type :
conf
DOI :
10.1109/IPDPSW.2012.139
Filename :
6270765
Link To Document :
بازگشت