DocumentCode :
3591141
Title :
Designing efficient small message transfer mechanism for inter-node MPI communication on InfiniBand GPU clusters
Author :
Rong Shi ; Potluri, Sreeram ; Hamidouche, Khaled ; Perkins, Jonathan ; Mingzhe Li ; Rossetti, Davide ; Panda, Dhabaleswar K.
Author_Institution :
Dept. of Comput. Sci. & Eng., Ohio State Univ., Columbus, OH, USA
fYear :
2014
Firstpage :
1
Lastpage :
10
Abstract :
Increasing number of MPI applications are being ported to take advantage of the compute power offered by GPUs. Data movement on GPU clusters continues to be the major bottleneck that keeps scientific applications from fully harnessing the potential of GPUs. Earlier, GPU-GPU inter-node communication has to move data from GPU memory to host memory before sending it over the network. MPI libraries like MVAPICH2 have provided solutions to alleviate this bottleneck using host-based pipelining techniques. Besides that, the newly introduced GPU Direct RDMA (GDR) is a promising solution to further solve this data movement bottleneck. However, existing design in MPI libraries applies the rendezvous protocol for all message sizes, which incurs considerable overhead for small message communications due to extra synchronization message exchange. In this paper, we propose new techniques to optimize internode GPU-to-GPU communications for small message sizes. Our designs to support the eager protocol include efficient support at both sender and receiver sides. Furthermore, we propose a new data path to provide fast copies between host and GPU memories. To the best of our knowledge, this is the first study to propose efficient designs for GPU communication for small message sizes, using eager protocol. Our experimental results demonstrate up to 59% and 63% reduction in latency for GPU-to-GPU and CPU-to-GPU point-to-point communications, respectively. These designs boost the uni-directional bandwidth by 7.3x and 1.7x, respectively. We also evaluate our proposed design with two end-applications: GPULBM and HOOMD-blue. Performance numbers on Kepler GPUs shows that, compared to the best existing GDR design, our proposed designs achieve up to 23.4% latency reduction for GPULBM and 58% increase in average TPS for HOOMD-blue, respectively.
Keywords :
application program interfaces; graphics processing units; message passing; parallel architectures; pipeline processing; protocols; CPU-to-GPU point-to-point communication; GDR design; GPU direct RDMA; GPU memory; GPU-GPU inter-node communication; GPU-to-GPU point-to-point communication; GPULBM; HOOMD-blue; InfiniBand GPU cluster; MPI application; MPI library; MVAPICH2; data movement; host memory; host-based pipelining technique; inter-node MPI communication; internode GPU-to-GPU communication; latency reduction; message communication; message size; message transfer mechanism; rendezvous protocol; synchronization message exchange; uni-directional bandwidth; Bandwidth; Benchmark testing; Graphics processing units; Libraries; Performance evaluation; Protocols; Receivers; CUDA; GPU Direct RDMA; InfiniBand; MPI;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
High Performance Computing (HiPC), 2014 21st International Conference on
Print_ISBN :
978-1-4799-5975-4
Type :
conf
DOI :
10.1109/HiPC.2014.7116873
Filename :
7116873
Link To Document :
بازگشت