• DocumentCode
    625677
  • Title

    XKaapi: A Runtime System for Data-Flow Task Programming on Heterogeneous Architectures

  • Author

    Gautier, Thierry ; Lima, Joao V. F. ; Maillard, Nicolas ; Raffin, Bruno

  • Author_Institution
    INRIA, Grenoble, France
  • fYear
    2013
  • fDate
    20-24 May 2013
  • Firstpage
    1299
  • Lastpage
    1308
  • Abstract
    Most recent HPC platforms have heterogeneous nodes composed of multi-core CPUs and accelerators, like GPUs. Programming such nodes is typically based on a combination of OpenMP and CUDA/OpenCL codes; scheduling relies on a static partitioning and cost model. We present the XKaapi runtime system for data-flow task programming on multi-CPU and multi-GPU architectures, which supports a data-flow task model and a locality-aware work stealing scheduler. XKaapi enables task multi-implementation on CPU or GPU and multi-level parallelism with different grain sizes. We show performance results on two dense linear algebra kernels, matrix product (GEMM) and Cholesky factorization (POTRF), to evaluate XKaapi on a heterogeneous architecture composed of two hexa-core CPUs and eight NVIDIA Fermi GPUs. Our conclusion is two-fold. First, fine grained parallelism and online scheduling achieve performance results as good as static strategies, and in most cases outperform them. This is due to an improved work stealing strategy that includes locality information; a very light implementation of the tasks in XKaapi; and an optimized search for ready tasks. Next, the multi-level parallelism on multiple CPUs and GPUs enabled by XKaapi led to a highly efficient Cholesky factorization. Using eight NVIDIA Fermi GPUs and four CPUs, we measure up to 2.43 TFlop/s on double precision matrix product and 1.79 TFlop/s on Cholesky factorization; and respectively 5.09 TFlop/s and 3.92 TFlop/s in single precision.
  • Keywords
    data flow computing; graphics processing units; linear algebra; matrix decomposition; multiprocessing systems; optimisation; parallel architectures; processor scheduling; search problems; task analysis; CUDA; Cholesky factorization; Fermi GPU; HPC; NVIDIA; OpenCL code; OpenMP; XKaapi runtime system; accelerator; cost model; data flow task programming; dense linear algebra kernel; fine grained parallelism; grain size; heterogeneous architecture; heterogeneous node; locality aware work stealing scheduling; matrix product; multiGPU architecture; multicore CPU; multilevel parallelism; online scheduling; search optimization; static partitioning; static strategy; Data transfer; Graphics processing units; Instruction sets; Kernel; Parallel processing; Programming; Runtime; Data-Flow task model; Dense Linear Algebra; Heterogeneous architectures; High Performance Computing; Locality Aware Work Stealing;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Parallel & Distributed Processing (IPDPS), 2013 IEEE 27th International Symposium on
  • Conference_Location
    Boston, MA
  • ISSN
    1530-2075
  • Print_ISBN
    978-1-4673-6066-1
  • Type

    conf

  • DOI
    10.1109/IPDPS.2013.66
  • Filename
    6569905