مرکز منطقه ای اطلاع رساني علوم و فناوري - Fault-Aware Runtime Strategies for High-Performance Computing

DocumentCode :

811036

Title :

Fault-Aware Runtime Strategies for High-Performance Computing

Author :

Li, Yawei ; Lan, Zhiling ; Gujrati, Prashasta ; Sun, Xian-He

Author_Institution :

Dept. of Comput. Sci., Illinois Inst. of Technol., Chicago, IL

Volume :

Issue :

fYear :

2009

fDate :

4/1/2009 12:00:00 AM

Firstpage :

460

Lastpage :

473

Abstract :

As the scale of parallel systems continues to grow, fault management of these systems is becoming a critical challenge. While existing research mainly focuses on developing or improving fault tolerance techniques, a number of key issues remain open. In this paper, we propose runtime strategies for spare node allocation and job rescheduling in response to failure prediction. These strategies, together with failure predictor and fault tolerance techniques, construct a runtime system called FARS (Fault-Aware Runtime System). In particular, we propose a 0-1 knapsack model and demonstrate its flexibility and effectiveness for reallocating running jobs to avoid failures. Experiments, by means of synthetic data and real traces from production systems, show that FARS has the potential to significantly improve system productivity (i.e., performance and reliability).

Keywords :

failure analysis; knapsack problems; parallel processing; scheduling; software fault tolerance; 0-1 knapsack model; failure prediction; fault management; fault tolerance techniques; fault-aware runtime strategies; high-performance computing; job rescheduling; parallel systems; spare node allocation; Fault-tolerance; Parallel systems; Performance; Scheduling;

fLanguage :

English

Journal_Title :

Parallel and Distributed Systems, IEEE Transactions on

Publisher :

ieee

ISSN :

1045-9219

Type :

jour

DOI :

10.1109/TPDS.2008.128

Filename :

4569836

Link To Document :

https://search.ricest.ac.ir/dl/search/defaultta.aspx?DTC=49&DC=811036