DocumentCode :
2050605
Title :
MPI Alltoall Personalized Exchange on GPGPU Clusters: Design Alternatives and Benefit
Author :
Singh, Ashish Kumar ; Potluri, Sreeram ; Wang, Hao ; Kandalla, Krishna ; Sur, Sayantan ; Panda, Dhabaleswar K.
Author_Institution :
Dept. of Comput. Sci. & Eng., Ohio State Univ., Columbus, OH, USA
fYear :
2011
fDate :
26-30 Sept. 2011
Firstpage :
420
Lastpage :
427
Abstract :
General Purpose Graphics Processing Units (GPGPUs) are rapidly becoming an integral part of high-performance system architectures. The Tianhe-1A and Tsubame systems received significant attention for their architectures that leverage GPGPUs. Increasingly many scientific applications that were originally written for CPUs using MPI for parallelism are being ported to these hybrid CPU-GPU clusters. In the traditional sense, CPUs perform computation while the MPI library takes care of communication. When computation is performed on GPGPUs, the data has to be moved from device memory to main memory before it can be used in communication. Though GPGPUs provide huge compute potential, the data movement to and from GPGPUs is both a performance and productivity bottleneck. Recently, the MVAPICH2 MPI library has been modified to directly support point-to-point MPI communication from the GPU memory [1]. Using this support, programmers do not need to explicitly move data to main memory before using MPI. This feature also enables performance improvement due to tight integration of GPU data movement and MPI internal protocols. Typically, scientific applications spend a significant portion of their execution time in collective communication. Hence, optimizing performance of collectives has a significant impact on their performance. MPI_Alltoall is a heavily used collective that has O(N2) communication, for N processes. In this paper, we outline the major design alternatives for MPI_Alltoall collective communication operation on GPGPU clusters. We propose three design alternatives and provide a corresponding performance analysis. Using our dynamic staging techniques, the latency of MPI_Alltoall on GPU clusters can be improved by 44% over a user level implementation and 31% over a send-recv based implementation for 256 KByte messages on 8 processes.
Keywords :
coprocessors; message passing; software libraries; storage management; CPU-GPU clusters; GPGPU clusters; GPU memory; MPI alltoall personalized exchange; MPI internal protocols; MPI_Alltoall collective communication operation; MVAPICH2 MPI library; Tianhe-1A; Tsubame systems; device memory; dynamic staging techniques; general purpose graphics processing units; high-performance system architectures; main memory; point-to-point MPI communication; scientific applications; send-recv based implementation; Algorithm design and analysis; Clustering algorithms; Computer architecture; Graphics processing unit; Libraries; Peer to peer computing; Performance evaluation; Clusters; Collectives; GPGPU; Infiniband; MPI;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Cluster Computing (CLUSTER), 2011 IEEE International Conference on
Conference_Location :
Austin, TX
Print_ISBN :
978-1-4577-1355-2
Electronic_ISBN :
978-0-7695-4516-5
Type :
conf
DOI :
10.1109/CLUSTER.2011.67
Filename :
6061073
Link To Document :
بازگشت