Title :
A fault tolerant self-scheduling scheme for parallel loops on shared memory systems
Author :
Yizhuo Wang ; Nicolau, A. ; Cammarota, Rosario ; Veidenbaum, A.V.
Author_Institution :
Sch. of Comput. Sci. & Technol., Beijing Inst. of Technol., Beijing, China
Abstract :
As the number of cores per chip increases, significant speedup for many applications could be achieved by exploiting loop level parallelism (LLP). Meanwhile, ever scaling device size makes multicore/multiprocessor systems suffer from increased reliability problems. Scheduling scheme plays a key role to exploit LLP. In existing dynamic loop scheduling schemes, self-scheduling is the most commonly used scheme1. This paper presents FTSS, a fault tolerant self-scheduling scheme which aims to execute parallel loops efficiently in the presence of hardware faults on shared memory systems. Our technique transforms a loop to ensure the correctness of the re-execution of loop iterations by buffering variables with anti-dependences, which make it possible to design a fault tolerant loop scheduling scheme without checkpointing. FTSS combines work-stealing with self-scheduling, and uses a bidirectional execution model when work is stolen from a faulty core. Experimental results show that FTSS achieve better load balancing than existing self-scheduling schemes. Compared with checkpoint/restart implementations that save a checkpoint before executing each chunk of iterations and restart the whole chunk running on a faulty core, FTSS exhibits better runtime performance. In addition, FTSS greatly outperforms existing self-scheduling schemes in terms of performance and stability in heavy loaded runtime environment.
Keywords :
fault tolerant computing; iterative methods; parallel processing; performance evaluation; processor scheduling; shared memory systems; FTSS; LLP; bidirectional execution model; fault tolerant loop scheduling scheme; fault tolerant self-scheduling scheme; heavy loaded runtime environment; load balancing; loop iteration re-execution; loop level parallelism; multicore systems; multiprocessor systems; parallel loops; runtime performance; shared memory systems; work-stealing; fault tolerance; loop scheduling; multicore processors; self-scheduling;
Conference_Titel :
High Performance Computing (HiPC), 2012 19th International Conference on
Conference_Location :
Pune
Print_ISBN :
978-1-4673-2372-7
Electronic_ISBN :
978-1-4673-2370-3
DOI :
10.1109/HiPC.2012.6507476