Title :
GPU-Aware MPI on RDMA-Enabled Clusters: Design, Implementation and Evaluation
Author :
Hao Wang ; Potluri, Sreeram ; Bureddy, D. ; Rosales, Carlos ; Panda, Dhabaleswar K.
Author_Institution :
Dept. of Comput. Sci. & Eng., Ohio State Univ., Columbus, OH, USA
Abstract :
Designing high-performance and scalable applications on GPU clusters requires tackling several challenges. The key challenge is the separate host memory and device memory, which requires programmers to use multiple programming models, such as CUDA and MPI, to operate on data in different memory spaces. This challenge becomes more difficult to tackle when non-contiguous data in multidimensional structures is used by real-world applications. These challenges limit the programming productivity and the application performance. We propose the GPU-Aware MPI to support data communication from GPU to GPU using standard MPI. It unifies the separate memory spaces, and avoids explicit CPU-GPU data movement and CPU/GPU buffer management. It supports all MPI datatypes on device memory with two algorithms: a GPU datatype vectorization algorithm and a vector based GPU kernel data pack and unpack algorithm. A pipeline is designed to overlap the non-contiguous data packing and unpacking on GPUs, the data movement on the PCIe, and the RDMA data transfer on the network. We incorporate our design with the open-source MPI library MVAPICH2 and optimize a production application: the multiphase 3D LBM. Besides the increase of programming productivity, we observe up to 19.9 percent improvement in application-level performance on 64 GPUs of the Oakley supercomputer.
Keywords :
application program interfaces; file organisation; graphics processing units; libraries; mainframes; message passing; parallel machines; peripheral interfaces; public domain software; GPU clusters; GPU datatype vectorization algorithm; GPU-aware MPI; Oakley supercomputer; PCIe; RDMA data transfer; RDMA-enabled clusters; application-level performance; data communication; device memory; host memory; multidimensional structures; multiphase 3D LBM; noncontiguous data packing; noncontiguous data unpacking; open-source MPI library MVAPICH2; vector based GPU kernel data pack algorithm; vector based GPU kernel data unpack algorithm; Algorithm design and analysis; Data communication; Graphics processing units; Kernel; Memory management; Pipelines; Vectors; CUDA; GPU; InfiniBand; Lattice Boltzmann method; MPI; RDMA;
Journal_Title :
Parallel and Distributed Systems, IEEE Transactions on
DOI :
10.1109/TPDS.2013.222