DocumentCode :
262127
Title :
Fault-Tolerant Global Load Balancing in X10
Author :
Bungart, Marco ; Fohry, Claudia ; Posner, Jonas
Author_Institution :
Res. Group Program. Languages, Methodologies Univ. of Kassel, Kassel, Germany
fYear :
2014
fDate :
22-25 Sept. 2014
Firstpage :
471
Lastpage :
478
Abstract :
Scalability postulates fault tolerance to be effective. We consider a user-level fault tolerance technique to cope with permanent node failures. It is supported by X10, one of the major Partitioned Global Address Space (PGAS) languages. In Resilient X10, an exception is thrown when a place (node) fails. This paper investigates task pools, which are often used by irregular applications to balance their load. We consider global load balancing with one worker per place. Each worker maintains a private task pool and supports cooperative work stealing. Tasks may generate new tasks dynamically, are free of side-effects, and their results are combined by reduction. Our first contribution is a task pool algorithm that can handle permanent place failures. It is based on snapshots that are regularly written to other workers and are updated in the event of stealing. Second, we implemented the algorithm in the Global Load Balancing framework GLB, which is part of the standard library of X10. We ran experiments with the Unbalanced Tree Search (UTS) and Between ness Centrality (BC) benchmarks. With 64 places on 4 nodes, for instance, we observed an overhead of about 4% for using fault-tolerant GLB instead of GLB. The protocol overhead for a place failure was neglectable.
Keywords :
fault tolerant computing; parallel programming; resource allocation; software libraries; BC benchmark; PGAS languages; Resilient X10; UTS benchmark; betweenness centrality benchmark; cooperative work stealing; dynamic task generation; fault-tolerant GLB; fault-tolerant global load balancing; irregular applications; partitioned global address space; permanent node failures; place failure; private task pool algorithm; protocol overhead; standard library; unbalanced tree search benchmark; user-level fault tolerance technique; Data structures; Electronics packaging; Fault tolerance; Fault tolerant systems; Load management; Protocols; Registers; GLB; Resilient X10; algorithmic resilience; task pool;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Symbolic and Numeric Algorithms for Scientific Computing (SYNASC), 2014 16th International Symposium on
Conference_Location :
Timisoara
Print_ISBN :
978-1-4799-8447-3
Type :
conf
DOI :
10.1109/SYNASC.2014.69
Filename :
7034719
Link To Document :
بازگشت