DocumentCode
262127
Title
Fault-Tolerant Global Load Balancing in X10
Author
Bungart, Marco ; Fohry, Claudia ; Posner, Jonas
Author_Institution
Res. Group Program. Languages, Methodologies Univ. of Kassel, Kassel, Germany
fYear
2014
fDate
22-25 Sept. 2014
Firstpage
471
Lastpage
478
Abstract
Scalability postulates fault tolerance to be effective. We consider a user-level fault tolerance technique to cope with permanent node failures. It is supported by X10, one of the major Partitioned Global Address Space (PGAS) languages. In Resilient X10, an exception is thrown when a place (node) fails. This paper investigates task pools, which are often used by irregular applications to balance their load. We consider global load balancing with one worker per place. Each worker maintains a private task pool and supports cooperative work stealing. Tasks may generate new tasks dynamically, are free of side-effects, and their results are combined by reduction. Our first contribution is a task pool algorithm that can handle permanent place failures. It is based on snapshots that are regularly written to other workers and are updated in the event of stealing. Second, we implemented the algorithm in the Global Load Balancing framework GLB, which is part of the standard library of X10. We ran experiments with the Unbalanced Tree Search (UTS) and Between ness Centrality (BC) benchmarks. With 64 places on 4 nodes, for instance, we observed an overhead of about 4% for using fault-tolerant GLB instead of GLB. The protocol overhead for a place failure was neglectable.
Keywords
fault tolerant computing; parallel programming; resource allocation; software libraries; BC benchmark; PGAS languages; Resilient X10; UTS benchmark; betweenness centrality benchmark; cooperative work stealing; dynamic task generation; fault-tolerant GLB; fault-tolerant global load balancing; irregular applications; partitioned global address space; permanent node failures; place failure; private task pool algorithm; protocol overhead; standard library; unbalanced tree search benchmark; user-level fault tolerance technique; Data structures; Electronics packaging; Fault tolerance; Fault tolerant systems; Load management; Protocols; Registers; GLB; Resilient X10; algorithmic resilience; task pool;
fLanguage
English
Publisher
ieee
Conference_Titel
Symbolic and Numeric Algorithms for Scientific Computing (SYNASC), 2014 16th International Symposium on
Conference_Location
Timisoara
Print_ISBN
978-1-4799-8447-3
Type
conf
DOI
10.1109/SYNASC.2014.69
Filename
7034719
Link To Document