DocumentCode :
2052346
Title :
Easy and reliable cluster management: the self-management experience of Fire Phoenix
Author :
Zhi-Hong, Zhang ; Dan, Meng ; Jian-Feng, Zhan ; Lei, Wang ; Lin-ping, Wu ; Wei, Huang
Author_Institution :
Inst. of Comput. Technol., Chinese Acad. of Sci., Beijing
fYear :
2006
fDate :
25-29 April 2006
Abstract :
High-Performance clusters are rapidly becoming an important computing platform for both scientific and business applications. To fulfil the new demands and challenges, cluster system software is inevitably complex. Even for experienced administrators, the management of a cluster system is an exhausting job. This paper introduces Fire Phoenix, a scalable and self-managing cluster system software that supports both scientific and commercial applications. With the self-configuring and self-healing features, much of the machine configuration and error recovery can be done automatically. Our design has been proven effective in the operations of the Dawning 4000A supercomputer, which is the biggest cluster system in China
Keywords :
computer network management; computer network reliability; workstation clusters; Dawning 4000A supercomputer; Fire Phoenix; cluster system software; computer cluster management; error recovery; machine configuration; Application software; Availability; Business; Computers; Costs; Fires; Scalability; Scientific computing; Supercomputers; System software;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Parallel and Distributed Processing Symposium, 2006. IPDPS 2006. 20th International
Conference_Location :
Rhodes Island
Print_ISBN :
1-4244-0054-6
Type :
conf
DOI :
10.1109/IPDPS.2006.1639694
Filename :
1639694
Link To Document :
بازگشت