Designing Optimized MPI Broadcast and Allreduce for Many Integrated Core (MIC) InfiniBand Clusters

Author

Kandalla, Krishna ; Venkatesh, Akshay ; Hamidouche, Khaled ; Potluri, Sreeram ; Bureddy, D. ; Panda, Dhabaleswar K.

Author_Institution

Dept. of Comput. Sci. & Eng., Ohio State Univ., Columbus, OH, USA

fYear

2013

fDate

21-23 Aug. 2013

Firstpage

63

Lastpage

70

Abstract

The emergence of co-processors such as Intel Many Integrated Cores (MICs) is changing the landscape of supercomputing. The MIC is a memory constrained environment and its processors also operate at slower clock rates. Furthermore, the communication characteristics between MIC processes are also different compared to communication between host processes. Communication libraries that do not consider these architectural subtleties cannot deliver good communication performance. The performance of MPI collective operations strongly affect the performance of parallel applications. Owing to the challenges introduced by the emerging heterogeneous systems, it is critical to fundamentally re-design collective algorithms to ensure that applications can fully leverage the MIC architecture. In this paper, we propose a generic framework to optimize the performance of important collective operations, such as, MPI Bcast, MPI Reduce and MPI Allreduce, on Intel MIC clusters. We also present a detailed analysis of the compute phases in reduce operations for MIC clusters. To the best of our knowledge, this is the first paper to propose novel designs to improve the performance of collectives on MIC clusters. Our designs improve the latency of the MPI Bcast operation with 4,864 MPI processes by up to 76%. We also observe up to 52.4% improvements in the communication latency of the MPI Allreduce operation with 2K MPI processes on heterogeneous MIC clusters. Our designs also improve the execution time of the WindJammer application by up to 16%.

Keywords

application program interfaces; clocks; coprocessors; message passing; multiprocessing systems; parallel architectures; performance evaluation; Intel MIC InfiniBand clusters; Intel many integrated core InfiniBand clusters; MIC architecture; MIC process; MPI collective operation performance; MPI_Allreduce; MPI_Bcast operation latency; WindJammer application; clock rates; communication characteristics; communication libraries; communication performance; coprocessors; heterogeneous MIC clusters; heterogeneous systems; memory constrained environment; optimized MPI allreduce; optimized MPI broadcast; supercomputing; Algorithm design and analysis; Clustering algorithms; Coprocessors; Libraries; Microwave integrated circuits; Performance evaluation; Slabs;

fLanguage

English

Publisher

ieee

Conference_Titel

High-Performance Interconnects (HOTI), 2013 IEEE 21st Annual Symposium on

Conference_Location

San Jose, CA

Type

conf

DOI

10.1109/HOTI.2013.26

Filename

6627737