Title :
A work-stealing scheduling framework supporting fault tolerance
Author :
Wang, Yizhuo ; Ji, Weixing ; Shi, Feng ; Zuo, Qi
Author_Institution :
School of Computer Science and Technology, Beijing Institute of Technology, China
Abstract :
Fault tolerance and load balancing are critical points for executing long-running parallel applications on multicore clusters. This paper addresses both fault tolerance and load balancing on multicore clusters by presenting a novel work-stealing task scheduling framework which supports hardware fault tolerance. In this framework, both transient and permanent faults are detected and recovered at task granularity. We incorporate task-based fault detection and recovery mechanisms into a hierarchical work-stealing scheme to establish the framework. This framework provides low-overhead fault-tolerance and optimal load balancing by fully exploiting task parallelism.
Keywords :
Checkpointing; Computer crashes; Fault tolerance; Fault tolerant systems; Multicore processing; Parallel processing; Transient analysis; cluster; fault tolerance; multicore; work-stealing;
Conference_Titel :
Design, Automation & Test in Europe Conference & Exhibition (DATE), 2013
Conference_Location :
Grenoble, France
Print_ISBN :
978-1-4673-5071-6
DOI :
10.7873/DATE.2013.150