DocumentCode :
3042730
Title :
Fault-aware job scheduling for BlueGene/L systems
Author :
Oliner, A.J. ; Sahoo, K. ; Moreira, J.E. ; Gupta, M. ; Sivasubramaniam, A.
Author_Institution :
Dept. of Electr. Eng. & Comput. Sci., Massachusetts Inst. of Technol., Cambridge, MA, USA
fYear :
2004
fDate :
26-30 April 2004
Firstpage :
64
Abstract :
Summary form only given. Large-scale systems like BlueGene/L are susceptible to a number of software and hardware failures that can affect system performance. We evaluate the effectiveness of a previously developed job scheduling algorithm for BlueGene/L in the presence of faults. We have developed two new job-scheduling algorithms considering failures while scheduling the jobs. We have also evaluated the impact of these algorithms on average bounded slowdown, average response time and system utilization, considering different levels of proactive failure prediction and prevention techniques reported in the literature. Our simulation studies show that the use of these new algorithms with even trivial fault prediction confidence or accuracy levels (as low as 10%) can significantly improve the performance of the BlueGene/L system.
Keywords :
parallel machines; performance evaluation; processor scheduling; system recovery; BlueGene/L systems; average response time; fault-aware job scheduling algorithm; proactive failure prediction; system utilization; Computer science; Concurrent computing; Delay; Laboratories; Large-scale systems; Parallel machines; Processor scheduling; Scheduling algorithm; Switches; Time sharing computer systems;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Parallel and Distributed Processing Symposium, 2004. Proceedings. 18th International
Print_ISBN :
0-7695-2132-0
Type :
conf
DOI :
10.1109/IPDPS.2004.1302991
Filename :
1302991
Link To Document :
بازگشت