Title :
Design and Implementation of an Integrated Fault-Supervising System for Large HPCs
Author :
Qi, Chunsheng ; Zheng, Xiao ; Kuang, Biying ; Zhou, Wei
Author_Institution :
Jiangnan Inst. of Comput. Technol., Wuxi
Abstract :
Faults and failures are the biggest obstacles that limit high-performance computing systems (HPCs) to exert their functions and performance. To minimize or dispel their influences, the HPCs must be supervised to obtain correlative information in time, according to which effective actions could be taken as soon as possible. To probe into this solution, an integrated fault-supervising system (IFS) designed for a large HPC system is presented in this paper, with large numbers of distributed sensors and intelligent control units to acquire fault information rapidly. Furthermore, it has the ability of automatic emergency processing in certain circumstances according to the acquired information, and it supports both local and remote management with convenient and visual interfaces. Up to now, the supervising system has been acting well for a few years and helps the target system reach over 90% availability, which indicates in a degree that the design is successful and the supervising system deserves further research and could have a bright future in wider range of application.
Keywords :
distributed sensors; fault tolerant computing; parallel processing; system monitoring; system recovery; automatic emergency processing; distributed sensor; high-performance computing system failure; integrated fault-supervising system design; intelligent control unit; remote management; visual interface; Availability; Computers; Cooling; High performance computing; Large-scale systems; Maintenance; Power supplies; Power system reliability; Remote monitoring; Temperature;
Conference_Titel :
Young Computer Scientists, 2008. ICYCS 2008. The 9th International Conference for
Conference_Location :
Hunan
Print_ISBN :
978-0-7695-3398-8
Electronic_ISBN :
978-0-7695-3398-8
DOI :
10.1109/ICYCS.2008.121