مرکز منطقه ای اطلاع رساني علوم و فناوري - Efficient Intra-node Communication on Intel-MIC Clusters

DocumentCode :

611018

Title :

Efficient Intra-node Communication on Intel-MIC Clusters

Author :

Potluri, Sreeram ; Venkatesh, Akshay ; Bureddy, D. ; Kandalla, Krishna ; Panda, Dhabaleswar K.

fYear :

2013

fDate :

13-16 May 2013

Firstpage :

128

Lastpage :

135

Abstract :

Accelerators and coprocessors have become a key component in modern supercomputing systems due to the superior performance per watt that they offer. Intel´s Xeon Phi coprocessor packs up to 1 TFLOP of double precision performance in a single chip while providing x86 compatibility and supporting popular programming models like MPI and OpenMP. This makes it an attractive choice for accelerating HPC applications. The Xeon Phi provides several channels for communication between MPI processes running on the coprocessor and the host. While supporting POSIX shared memory within the coprocessor, it exposes a low level API called the Symmetric Communication Interface (SCIF) that gives direct control of the DMA engine to the user. SCIF can also be used for communication between the coprocessor and the host. Xeon Phi also provides an implementation of the InfiniBand (IB) Verbs interface that enables a direct communication link with the InfiniBand adapter for communication between the coprocessor and the host. In this paper, we propose and evaluate design alternatives for efficient communication on a node with Xeon Phi coprocessor. We incorporate our designs in the popular MVAPICH2 MPI library. We use shared memory, IB Verbs and SCIF to design a hybrid solution that improves the MPI communication latency from Xeon Phi to the Host by 70%, for 4MByte messages, compared to an out-of-the-box version of MVAPICH2. Our solution delivers more than 6x improvement in peak uni-directional bandwidth from Xeon Phi to the Host and more than 3x improvement in bi-directional bandwidth. Through our designs, we are able to improve the performance of 16 process Gather, Alltoall and All gather collective operations by 70%, 85% and 80%, respectively, for 4MB messages. We further evaluate our designs using application benchmarks and show improvements of up to 18% with a 3D Stencil kernel and up to 11.5% with the P3DFFT library.

Keywords :

coprocessors; message passing; parallel machines; peripheral interfaces; shared memory systems; DMA engine; IB; InfiniBand Verbs interface; Intel Xeon Phi coprocessor; Intel-MIC clusters; MVAPICH2 MPI library; OpenMP; P3DFFT library; POSIX shared memory; SCIF; coprocessors; double precision performance; efficient intranode communication; modern supercomputing systems; peak unidirectional bandwidth; single chip; symmetric communication interface; x86 compatibility; Bandwidth; Communication channels; Coprocessors; Kernel; Libraries; Microwave integrated circuits; Peer-to-peer computing; Clusters; InfiniBand; Intra-node Communication; MPI; SCIF; Xeon Phi;

fLanguage :

English

Publisher :

ieee

Conference_Titel :

Cluster, Cloud and Grid Computing (CCGrid), 2013 13th IEEE/ACM International Symposium on

Conference_Location :

Delft

Print_ISBN :

978-1-4673-6465-2

Type :

conf

DOI :

10.1109/CCGrid.2013.86

Filename :

6546070

Link To Document :

https://search.ricest.ac.ir/dl/search/defaultta.aspx?DTC=49&DC=611018