مرکز منطقه ای اطلاع رساني علوم و فناوري - Optimized Non-contiguous MPI Datatype Communication for GPU Clusters: Design, Implementation and Evaluation with MVAPICH2

DocumentCode :

2052659

Title :

Optimized Non-contiguous MPI Datatype Communication for GPU Clusters: Design, Implementation and Evaluation with MVAPICH2

Author :

Wang, Hao ; Potluri, Sreeram ; Luo, Miao ; Singh, Ashish Kumar ; Ouyang, Xiangyong ; Sur, Sayantan ; Panda, Dhabaleswar K.

Author_Institution :

Dept. of Comput. Sci. & Eng., Ohio State Univ., Columbus, OH, USA

fYear :

2011

fDate :

26-30 Sept. 2011

Firstpage :

308

Lastpage :

316

Abstract :

Data parallel architectures, such as General Purpose Graphics Units (GPGPUs) have seen a tremendous rise in their application for High End Computing. However, data movement in and out of GPGPUs remains the biggest hurdle to overall performance and programmer productivity. Real scientific applications utilize multi-dimensional data. Data in higher dimensions may not be contiguous in memory. In order to improve programmer productivity and to enable communication libraries to optimize non-contiguous data communication, the MPI interface provides MPI data types. Currently, state of the art MPI libraries do not provide native data type support for data that resides in GPU memory. The management of non-contiguous GPU data is a source of productivity and performance loss, because GPU application developers have to manually move the data out of and in to GPUs. In this paper, we present our design for enabling high-performance communication support between GPUs for non-contiguous data types. We describe our innovative approach to improve performance by "offloading" data type packing and unpacking on to a GPU device, and "pipelining" all data transfer stages between two GPUs. Our design is integrated into the popular MVAPICH2 MPI library for InfiniBand, iWARP and RoCE clusters. We perform a detailed evaluation of our design on a GPU cluster with the latest NVIDIA Fermi GPU adapters. The evaluation reveals that the proposed designs can achieve up to 88% latency improvement for vector data type at 4 MB size with micro benchmarks. For Stencil2D application from the SHOC benchmark suite, our design can simplify the data communication in its main loop, reducing the lines of code by 36%. Further, our method can improve the performance of Stencil2D by up to 42% for single precision data set, and 39% for double precision data set. To the best of our knowledge, this is the first such design, implementation and evaluation of non-contiguous MPI data communication for GPU clusters.

Keywords :

coprocessors; data communication; GPU clusters; GPU memory; InfiniBand; MPI interface; MVAPICH2 MPI library; NVIDIA Fermi GPU adapters; RoCE cluster; data communication; data movement; data parallel architectures; data type packing; general purpose graphics units; high end computing; high performance communication support; iWARP; multidimensional data; noncontiguous GPU data; noncontiguous MPI data type communication; noncontiguous data types; programmer productivity; Computer architecture; Graphics processing unit; Libraries; Performance evaluation; Pipeline processing; Programming; Receivers; Cluster; GPGPU; MPI; Non-contiguous;

fLanguage :

English

Publisher :

ieee

Conference_Titel :

Cluster Computing (CLUSTER), 2011 IEEE International Conference on

Conference_Location :

Austin, TX

Print_ISBN :

978-1-4577-1355-2

Electronic_ISBN :

978-0-7695-4516-5

Type :

conf

DOI :

10.1109/CLUSTER.2011.42

Filename :

6061149

Link To Document :

https://search.ricest.ac.ir/dl/search/defaultta.aspx?DTC=49&DC=2052659