مرکز منطقه ای اطلاع رساني علوم و فناوري - Methodologies and application of machine learning algorithms to classify the performance of high performance cluster components

DocumentCode :

166598

Title :

Methodologies and application of machine learning algorithms to classify the performance of high performance cluster components

Author :

Romero, Pablo ; Idler, Craig

Author_Institution :

High Performance Comput. 1, Los Alamos Nat. Lab., Los Alamos, NM, USA

fYear :

2014

fDate :

22-26 Sept. 2014

Firstpage :

400

Lastpage :

407

Abstract :

High Performance Computing Clusters are designed to host highly parallelized applications, often in excess of thousands of nodes allocated to a job. These jobs, especially those that require a high level of synchronous communication, can be greatly affected by a single poor, or even sub-standard performing component. These components, often referred to as a node, are typically comprised of CPUs, accelerator processors, memory, a communication bus, and so on. Consequently it is important to identify and eliminate these sub-standard performing nodes before a job is scheduled onto them. In this paper we will describe the process used to measure and the methodology used to quantify poor performing nodes or classify suspect performing nodes into groups, or clusters, that can be later used to identify future performance issues. This process is more involved than simply running a scientific calculation across all the nodes, finding one that was “slow”, and labeling it as a bad node. At Los Alamos, this methodology has been used successfully to find problem nodes and has helped characterize the components of other clusters to aid in the proactive elimination of potential problems.

Keywords :

learning (artificial intelligence); parallel processing; performance evaluation; CPU; accelerator processors; high performance computing clusters; parallelized applications; poor performing nodes; scientific calculation; suspect performing nodes; synchronous communication; Bandwidth; Clustering algorithms; Graphics processing units; Principal component analysis; Standards; Testing;

fLanguage :

English

Publisher :

ieee

Conference_Titel :

Cluster Computing (CLUSTER), 2014 IEEE International Conference on

Conference_Location :

Madrid

Type :

conf

DOI :

10.1109/CLUSTER.2014.6968669

Filename :

6968669

Link To Document :

https://search.ricest.ac.ir/dl/search/defaultta.aspx?DTC=49&DC=166598