• DocumentCode
    262127
  • Title

    Fault-Tolerant Global Load Balancing in X10

  • Author

    Bungart, Marco ; Fohry, Claudia ; Posner, Jonas

  • Author_Institution
    Res. Group Program. Languages, Methodologies Univ. of Kassel, Kassel, Germany
  • fYear
    2014
  • fDate
    22-25 Sept. 2014
  • Firstpage
    471
  • Lastpage
    478
  • Abstract
    Scalability postulates fault tolerance to be effective. We consider a user-level fault tolerance technique to cope with permanent node failures. It is supported by X10, one of the major Partitioned Global Address Space (PGAS) languages. In Resilient X10, an exception is thrown when a place (node) fails. This paper investigates task pools, which are often used by irregular applications to balance their load. We consider global load balancing with one worker per place. Each worker maintains a private task pool and supports cooperative work stealing. Tasks may generate new tasks dynamically, are free of side-effects, and their results are combined by reduction. Our first contribution is a task pool algorithm that can handle permanent place failures. It is based on snapshots that are regularly written to other workers and are updated in the event of stealing. Second, we implemented the algorithm in the Global Load Balancing framework GLB, which is part of the standard library of X10. We ran experiments with the Unbalanced Tree Search (UTS) and Between ness Centrality (BC) benchmarks. With 64 places on 4 nodes, for instance, we observed an overhead of about 4% for using fault-tolerant GLB instead of GLB. The protocol overhead for a place failure was neglectable.
  • Keywords
    fault tolerant computing; parallel programming; resource allocation; software libraries; BC benchmark; PGAS languages; Resilient X10; UTS benchmark; betweenness centrality benchmark; cooperative work stealing; dynamic task generation; fault-tolerant GLB; fault-tolerant global load balancing; irregular applications; partitioned global address space; permanent node failures; place failure; private task pool algorithm; protocol overhead; standard library; unbalanced tree search benchmark; user-level fault tolerance technique; Data structures; Electronics packaging; Fault tolerance; Fault tolerant systems; Load management; Protocols; Registers; GLB; Resilient X10; algorithmic resilience; task pool;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Symbolic and Numeric Algorithms for Scientific Computing (SYNASC), 2014 16th International Symposium on
  • Conference_Location
    Timisoara
  • Print_ISBN
    978-1-4799-8447-3
  • Type

    conf

  • DOI
    10.1109/SYNASC.2014.69
  • Filename
    7034719