مرکز منطقه ای اطلاع رساني علوم و فناوري - Reliability-aware resource allocation in HPC systems

DocumentCode :

2888314

Title :

Reliability-aware resource allocation in HPC systems

Author :

Gottumukkala, Narasimha Raju ; Leangsuksun, Chokchai Box ; Taerat, Narate ; Nassar, Raja ; Scott, Stephen L.

Author_Institution :

eXtreme Comput. Res. Group, Louisiana Tech Univ., Ruston, LA

fYear :

2007

fDate :

17-20 Sept. 2007

Firstpage :

312

Lastpage :

321

Abstract :

Failures and downtimes have severe impact on the performance of parallel programs in a large scale High Performance Computing (HPC) environment. There were several research efforts to understand the failure behavior of computing systems. However, the presence of multitude of hardware and software components required for uninterrupted operation of parallel programs make failure and reliability prediction a challenging problem. HPC run-time systems like checkpoint frameworks and resource managers rely on the reliability knowledge of resources to minimize the performance loss due to unexpected failures. In this paper, we first analyze the Time Between Failure (TBF) distribution of individual nodes from a 512-node HPC system. Time varying distributions like Weibull, lognormal and gamma are observed to have better goodness-of-fit as compared to exponential distribution. We then present a reliability-aware resource allocation model for parallel programs based on one of the time varying distributions and present reliability-aware resource allocation algorithms to minimize the performance loss due to failures. We show the effectiveness of reliability-aware resource allocation algorithms based on the actual failure logs of the 512 node system and parallel workloads obtained from LANL and SDSC. The simulation results indicate that applying reliability-aware resource allocation techniques reduce the overall waste time of parallel jobs by as much as 30%. A further improvement by 15% in waste time is possible by considering the job run lengths in reliability-aware scheduling.

Keywords :

parallel algorithms; parallel programming; resource allocation; software performance evaluation; software reliability; statistical distributions; system recovery; 512 node system; HPC run-time system; actual failure log; hardware-software component; high performance computing; parallel program performance; reliability-aware resource allocation algorithm; time between failure distribution; time varying distribution; Computer networks; Concurrent computing; Educational institutions; High performance computing; Large-scale systems; Performance loss; Power system reliability; Quality of service; Resource management; Switches;

fLanguage :

English

Publisher :

ieee

Conference_Titel :

Cluster Computing, 2007 IEEE International Conference on

Conference_Location :

Austin, TX

ISSN :

1552-5244

Print_ISBN :

978-1-4244-1387-4

Electronic_ISBN :

1552-5244

Type :

conf

DOI :

10.1109/CLUSTR.2007.4629245

Filename :

4629245

Link To Document :

https://search.ricest.ac.ir/dl/search/defaultta.aspx?DTC=49&DC=2888314