DocumentCode
166598
Title
Methodologies and application of machine learning algorithms to classify the performance of high performance cluster components
Author
Romero, Pablo ; Idler, Craig
Author_Institution
High Performance Comput. 1, Los Alamos Nat. Lab., Los Alamos, NM, USA
fYear
2014
fDate
22-26 Sept. 2014
Firstpage
400
Lastpage
407
Abstract
High Performance Computing Clusters are designed to host highly parallelized applications, often in excess of thousands of nodes allocated to a job. These jobs, especially those that require a high level of synchronous communication, can be greatly affected by a single poor, or even sub-standard performing component. These components, often referred to as a node, are typically comprised of CPUs, accelerator processors, memory, a communication bus, and so on. Consequently it is important to identify and eliminate these sub-standard performing nodes before a job is scheduled onto them. In this paper we will describe the process used to measure and the methodology used to quantify poor performing nodes or classify suspect performing nodes into groups, or clusters, that can be later used to identify future performance issues. This process is more involved than simply running a scientific calculation across all the nodes, finding one that was “slow”, and labeling it as a bad node. At Los Alamos, this methodology has been used successfully to find problem nodes and has helped characterize the components of other clusters to aid in the proactive elimination of potential problems.
Keywords
learning (artificial intelligence); parallel processing; performance evaluation; CPU; accelerator processors; high performance computing clusters; parallelized applications; poor performing nodes; scientific calculation; suspect performing nodes; synchronous communication; Bandwidth; Clustering algorithms; Graphics processing units; Principal component analysis; Standards; Testing;
fLanguage
English
Publisher
ieee
Conference_Titel
Cluster Computing (CLUSTER), 2014 IEEE International Conference on
Conference_Location
Madrid
Type
conf
DOI
10.1109/CLUSTER.2014.6968669
Filename
6968669
Link To Document