DocumentCode :
1987680
Title :
Optimized MPI Gather Collective for Many Integrated Core (MIC) InfiniBand Clusters
Author :
Venkatesh, Akshay ; Kandalla, Krishna ; Panda, Dhabaleswar K.
Author_Institution :
Dept. of Comput. Sci. & Eng., Ohio State Univ., Columbus, OH, USA
fYear :
2013
fDate :
15-16 Aug. 2013
Firstpage :
58
Lastpage :
63
Abstract :
Xeon Phi coprocessors are gaining popularity in the high performance computing community owing to its rendition of a highly parallel environment and X86 compatibility. The coprocessors, which conform to Intel´s Many Integrated Core (MIC) architecture, are being deployed at large scale also because they yield a high performance per Watt. Each Xeon Phi coprocessor, despite offering 1 Teraflop performance, is connected to systems as PCIe devices and hence experiences the accompanying bandwidth and latency degradations. MPI libraries need to be designed in an architecturally-aware manner and must leverage on software stacks available on the MIC to ensure minimum expenditure of time in communication. Along with the optimization of send-receive MPI primitives, collectives which are widely used by scientific applications need to designed at the algorithm level in a way that alleviates architectural bottlenecks. In this work, we propose novel algorithms based on hierarchical communication algorithm designs and pipelining techniques to improve the performance of the MPI_Gather collective. At the micro-benchmark level, for an 256-process MPI job with the root of the gather on the MIC, the proposed algorithms reduce the average MPI_Gather latency by up to 83% and 87% compared to the existing MVAPICH2 and Intel MPI implementations of the operation, respectively.
Keywords :
coprocessors; message passing; multiprocessing systems; parallel architectures; peripheral interfaces; pipeline processing; power aware computing; MIC architecture; MPI Gather collective; MPI Gather latency; MPI job processing; MPI library; PCIe device; X86; Xeon Phi coprocessor; bandwidth degradation; hierarchical communication algorithm; high performance computing; latency degradation; many integrated core infiniband cluster; micro-benchmark level; parallel environment; pipelining technique; scientific applications; send-receive MPI primitive optimization; software stacks; Algorithm design and analysis; Computer architecture; Coprocessors; Libraries; Microwave integrated circuits; Performance evaluation; Program processors;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Extreme Scaling Workshop (XSW), 2013
Conference_Location :
Boulder, CO
Type :
conf
DOI :
10.1109/XSW.2013.12
Filename :
6805043
Link To Document :
بازگشت