Title :
Methodologies and application of machine learning algorithms to classify the performance of high performance cluster components
Author :
Romero, Pablo ; Idler, Craig
Author_Institution :
High Performance Comput. 1, Los Alamos Nat. Lab., Los Alamos, NM, USA
Abstract :
High Performance Computing Clusters are designed to host highly parallelized applications, often in excess of thousands of nodes allocated to a job. These jobs, especially those that require a high level of synchronous communication, can be greatly affected by a single poor, or even sub-standard performing component. These components, often referred to as a node, are typically comprised of CPUs, accelerator processors, memory, a communication bus, and so on. Consequently it is important to identify and eliminate these sub-standard performing nodes before a job is scheduled onto them. In this paper we will describe the process used to measure and the methodology used to quantify poor performing nodes or classify suspect performing nodes into groups, or clusters, that can be later used to identify future performance issues. This process is more involved than simply running a scientific calculation across all the nodes, finding one that was “slow”, and labeling it as a bad node. At Los Alamos, this methodology has been used successfully to find problem nodes and has helped characterize the components of other clusters to aid in the proactive elimination of potential problems.
Keywords :
learning (artificial intelligence); parallel processing; performance evaluation; CPU; accelerator processors; high performance computing clusters; parallelized applications; poor performing nodes; scientific calculation; suspect performing nodes; synchronous communication; Bandwidth; Clustering algorithms; Graphics processing units; Principal component analysis; Standards; Testing;
Conference_Titel :
Cluster Computing (CLUSTER), 2014 IEEE International Conference on
Conference_Location :
Madrid
DOI :
10.1109/CLUSTER.2014.6968669