DocumentCode :
1925582
Title :
Enabling Fast, Noncontiguous GPU Data Movement in Hybrid MPI+GPU Environments
Author :
Jenkins, John ; Dinan, James ; Balaji, Pavan ; Samatova, Nagiza F. ; Thakur, Rajeev
Author_Institution :
Deptartment of Comput. Sci., North Carolina State Univ., Raleigh, NC, USA
fYear :
2012
fDate :
24-28 Sept. 2012
Firstpage :
468
Lastpage :
476
Abstract :
Lack of efficient and transparent interaction with GPU data in hybrid MPI+GPU environments challenges GPU acceleration of large-scale scientific computations. A particular challenge is the transfer of noncontiguous data to and from GPU memory. MPI implementations currently do not provide an efficient means of utilizing data types for noncontiguous communication of data in GPU memory. To address this gap, we present an MPI data type-processing system capable of efficiently processing arbitrary data types directly on the GPU. We present a means for converting conventional data type representations into a GPU-amenable format. Fine-grained, element-level parallelism is then utilized by a GPU kernel to perform in-device packing and unpacking of noncontiguous elements. We demonstrate a several-fold performance improvement for noncontiguous column vectors, 3D array slices, and 4D array sub volumes over CUDA-based alternatives. Compared with optimized, layout-specific implementations, our approach incurs low overhead, while enabling the packing of data types that do not have a direct CUDA equivalent. These improvements are demonstrated to translate to significant improvements in end-to-end, GPU-to-GPU communication time. In addition, we identify and evaluate communication patterns that may cause resource contention with packing operations, providing a baseline for adaptively selecting data-processing strategies.
Keywords :
graphics processing units; message passing; parallel architectures; scientific information systems; vectors; 3D array slices; 4D array subvolumes; CUDA; GPU acceleration; GPU kernel; GPU memory; GPU-amenable format; MPI datatype-processing system; arbitrary datatype processing; communication pattern evaluation; communication pattern identification; data-processing strategies; datatype packing; datatype representations; end-to-end GPU-to-GPU communication time; fast-noncontiguous GPU data movement; fine-grained element-level parallelism; graphics processing units; hybrid MPI-plus-GPU Environments; in-device packing; in-device unpacking; large-scale scientific computations; noncontiguous column vectors; noncontiguous data communication; noncontiguous data transfer; overhead; performance improvement; resource contention; Arrays; Encoding; Graphics processing unit; Instruction sets; Kernel; Parallel processing; Vectors; Datatype; GPU; MPI;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Cluster Computing (CLUSTER), 2012 IEEE International Conference on
Conference_Location :
Beijing
Print_ISBN :
978-1-4673-2422-9
Type :
conf
DOI :
10.1109/CLUSTER.2012.72
Filename :
6337810
Link To Document :
بازگشت