Optimized MPI Gather Collective for Many Integrated Core (MIC) InfiniBand Clusters

Author

Venkatesh, Akshay ; Kandalla, Krishna ; Panda, Dhabaleswar K.

Author_Institution

Dept. of Comput. Sci. & Eng., Ohio State Univ., Columbus, OH, USA

fYear

2013

fDate

15-16 Aug. 2013

Firstpage

58

Lastpage

63

Abstract

Xeon Phi coprocessors are gaining popularity in the high performance computing community owing to its rendition of a highly parallel environment and X86 compatibility. The coprocessors, which conform to Intel´s Many Integrated Core (MIC) architecture, are being deployed at large scale also because they yield a high performance per Watt. Each Xeon Phi coprocessor, despite offering 1 Teraflop performance, is connected to systems as PCIe devices and hence experiences the accompanying bandwidth and latency degradations. MPI libraries need to be designed in an architecturally-aware manner and must leverage on software stacks available on the MIC to ensure minimum expenditure of time in communication. Along with the optimization of send-receive MPI primitives, collectives which are widely used by scientific applications need to designed at the algorithm level in a way that alleviates architectural bottlenecks. In this work, we propose novel algorithms based on hierarchical communication algorithm designs and pipelining techniques to improve the performance of the MPI_Gather collective. At the micro-benchmark level, for an 256-process MPI job with the root of the gather on the MIC, the proposed algorithms reduce the average MPI_Gather latency by up to 83% and 87% compared to the existing MVAPICH2 and Intel MPI implementations of the operation, respectively.

Keywords

coprocessors; message passing; multiprocessing systems; parallel architectures; peripheral interfaces; pipeline processing; power aware computing; MIC architecture; MPI Gather collective; MPI Gather latency; MPI job processing; MPI library; PCIe device; X86; Xeon Phi coprocessor; bandwidth degradation; hierarchical communication algorithm; high performance computing; latency degradation; many integrated core infiniband cluster; micro-benchmark level; parallel environment; pipelining technique; scientific applications; send-receive MPI primitive optimization; software stacks; Algorithm design and analysis; Computer architecture; Coprocessors; Libraries; Microwave integrated circuits; Performance evaluation; Program processors;

fLanguage

English

Publisher

ieee

Conference_Titel

Extreme Scaling Workshop (XSW), 2013

Conference_Location

Boulder, CO

Type

conf

DOI

10.1109/XSW.2013.12

Filename

6805043