مرکز منطقه ای اطلاع رساني علوم و فناوري - Enabling Fast, Noncontiguous GPU Data Movement in Hybrid MPI+GPU Environments

DocumentCode :

1925582

Title :

Enabling Fast, Noncontiguous GPU Data Movement in Hybrid MPI+GPU Environments

Author :

Jenkins, John ; Dinan, James ; Balaji, Pavan ; Samatova, Nagiza F. ; Thakur, Rajeev

Author_Institution :

Deptartment of Comput. Sci., North Carolina State Univ., Raleigh, NC, USA

fYear :

2012

fDate :

24-28 Sept. 2012

Firstpage :

468

Lastpage :

476

Abstract :

Lack of efficient and transparent interaction with GPU data in hybrid MPI+GPU environments challenges GPU acceleration of large-scale scientific computations. A particular challenge is the transfer of noncontiguous data to and from GPU memory. MPI implementations currently do not provide an efficient means of utilizing data types for noncontiguous communication of data in GPU memory. To address this gap, we present an MPI data type-processing system capable of efficiently processing arbitrary data types directly on the GPU. We present a means for converting conventional data type representations into a GPU-amenable format. Fine-grained, element-level parallelism is then utilized by a GPU kernel to perform in-device packing and unpacking of noncontiguous elements. We demonstrate a several-fold performance improvement for noncontiguous column vectors, 3D array slices, and 4D array sub volumes over CUDA-based alternatives. Compared with optimized, layout-specific implementations, our approach incurs low overhead, while enabling the packing of data types that do not have a direct CUDA equivalent. These improvements are demonstrated to translate to significant improvements in end-to-end, GPU-to-GPU communication time. In addition, we identify and evaluate communication patterns that may cause resource contention with packing operations, providing a baseline for adaptively selecting data-processing strategies.

Keywords :

graphics processing units; message passing; parallel architectures; scientific information systems; vectors; 3D array slices; 4D array subvolumes; CUDA; GPU acceleration; GPU kernel; GPU memory; GPU-amenable format; MPI datatype-processing system; arbitrary datatype processing; communication pattern evaluation; communication pattern identification; data-processing strategies; datatype packing; datatype representations; end-to-end GPU-to-GPU communication time; fast-noncontiguous GPU data movement; fine-grained element-level parallelism; graphics processing units; hybrid MPI-plus-GPU Environments; in-device packing; in-device unpacking; large-scale scientific computations; noncontiguous column vectors; noncontiguous data communication; noncontiguous data transfer; overhead; performance improvement; resource contention; Arrays; Encoding; Graphics processing unit; Instruction sets; Kernel; Parallel processing; Vectors; Datatype; GPU; MPI;

fLanguage :

English

Publisher :

ieee

Conference_Titel :

Cluster Computing (CLUSTER), 2012 IEEE International Conference on

Conference_Location :

Beijing

Print_ISBN :

978-1-4673-2422-9

Type :

conf

DOI :

10.1109/CLUSTER.2012.72

Filename :

6337810

Link To Document :

https://search.ricest.ac.ir/dl/search/defaultta.aspx?DTC=49&DC=1925582