DocumentCode :
228742
Title :
Fault-Tolerant Dynamic Task Graph Scheduling
Author :
Kurt, Mehmet Can ; Krishnamoorthy, Sriram ; Agrawal, Kunal ; Agrawal, Gagan
fYear :
2014
fDate :
16-21 Nov. 2014
Firstpage :
719
Lastpage :
730
Abstract :
In this paper, we present an approach to fault tolerant execution of dynamic task graphs scheduled using work stealing. In particular, we focus on selective and localized recovery of tasks in the presence of soft faults. From users, we elicit the basic task graph structure in terms of successor and predecessor relationships. The work-stealing-based algorithm to schedule such a task graph is augmented to enable recovery when the data and metadata associated with a task get corrupted. We use this redundancy, and knowledge of the task graph structure, to selectively recover from faults with low space and time overheads. We show that the fault tolerant design retains the essential properties of the underlying work stealing-based task scheduling algorithm, and that the fault tolerant execution is asymptotically optimal when task re-execution is taken into account. Experimental evaluation demonstrates the low cost of recovery under various fault scenarios.
Keywords :
fault tolerant computing; graph theory; meta data; parallel processing; scheduling; task analysis; asymptotically optimal fault tolerant execution; fault tolerant design; fault-tolerant dynamic task graph scheduling; localized task recovery; metadata; space overheads; successor-predecessor relationships; task graph structure; time overheads; work stealing-based task scheduling algorithm; Arrays; Dynamic scheduling; Fault tolerance; Fault tolerant systems; Instruction sets; Radiation detectors; Scheduling algorithms; cilk; dag; fault tolerance; task graphs; work stealing;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
High Performance Computing, Networking, Storage and Analysis, SC14: International Conference for
Conference_Location :
New Orleans, LA
Print_ISBN :
978-1-4799-5499-5
Type :
conf
DOI :
10.1109/SC.2014.64
Filename :
7013046
Link To Document :
بازگشت