DocumentCode :
3678339
Title :
Exploiting GPUDirect RDMA in Designing High Performance OpenSHMEM for NVIDIA GPU Clusters
Author :
Khaled Hamidouche;Akshay Venkatesh;Ammar Ahmad Awan;Hari Subramoni;Ching-Hsiang Chu;Dhabaleswar K. Panda
Author_Institution :
Dept. of Comput. Sci. &
fYear :
2015
Firstpage :
78
Lastpage :
87
Abstract :
GPUDirect RDMA (GDR) brings the high-performance communication capabilities of RDMA networks like InfiniBand (IB) to GPUs (referred to as "Device"). It enables IB network adapters to directly write/read data to/from GPU memory. Partitioned Global Address Space (PGAS) programming models, such as OpenSHMEM, provide an attractive approach for developing scientific applications with irregular communication characteristics by providing shared memory address space abstractions, along with one-sided communication semantics. However, current approaches and designs of OpenSHMEM on GPU clusters do not take advantage of the GDR features leading to inefficiencies and sub-optimal performance. In this paper, we analyze the performance of various OpenSHMEM operations with different inter-node and intra-node communication configurations (Host-to-Device, Device-to-Device, and Device-to-Host) on GPU based systems. We propose novel designs that ensure "truly one-sided" communication for the different inter-/intra-node configurations identified above while working around the hardware limitations. To the best of our knowledge, this is the first work that investigates GDR-aware designs for OpenSHMEM communication operations. Experimental evaluations indicate 2.5X and 7X improvement in point-point communication for intra-node and inter-node, respectively. The proposed framework achieves 2.2μs for an intra-node 8 byte put operation from Host-to-Device, and 3.13μs for an inter-node 8 byte put operation from GPU to remote GPU. With Stencil2D application kernel from SHOC benchmark suite, we observe a 19% reduction in execution time on 64 GPU nodes. Further, for GPULBM application, we are able to improve the performance of the evolution phase by 53% and 45% on 32 and 64 GPU nodes, respectively.
Keywords :
"Graphics processing units","Performance evaluation","Peer-to-peer computing","Runtime","Electronics packaging","Programming","Pipeline processing"
Publisher :
ieee
Conference_Titel :
Cluster Computing (CLUSTER), 2015 IEEE International Conference on
Type :
conf
DOI :
10.1109/CLUSTER.2015.21
Filename :
7307568
Link To Document :
بازگشت