DocumentCode :
3678442
Title :
Evolution of Monitoring over the Lifetime of a High Performance Computing Cluster
Author :
Adam DeConinck;Kathleen Kelly
Author_Institution :
Los Alamos Nat. Lab., Los Alamos, NM, USA
fYear :
2015
Firstpage :
710
Lastpage :
713
Abstract :
High Performance Computer (HPC) systems typically have lifetimes of four to six years. During this lifetime a system will undergo substantial changes in the system software stack and hardware configuration. Simultaneously, the physical environment around it will change as old systems are retired and new systems are brought in. This report focuses on our experience with Mustang, a 1600 node Linux cluster at LANL. Over the three years we have operated Mustang, the machine and environment have changed substantially, which has resulted in reliability and stability issues on the cluster. In this report we present our experiences with standard monitoring and analysis tools available on Mustang since its installation, and how recent advances in our tools and usage have improved our ability to troubleshoot these issues and perform timely root cause analysis. These advances have both improved our management of existing installations as well as informed our hardware and tooling requirements for future systems.
Keywords :
"Monitoring","Hardware","Temperature sensors","Standards","Temperature measurement","Linux","Testing"
Publisher :
ieee
Conference_Titel :
Cluster Computing (CLUSTER), 2015 IEEE International Conference on
Type :
conf
DOI :
10.1109/CLUSTER.2015.123
Filename :
7307672
Link To Document :
بازگشت