DocumentCode :
3471556
Title :
Designing Optimized MPI Broadcast and Allreduce for Many Integrated Core (MIC) InfiniBand Clusters
Author :
Kandalla, Krishna ; Venkatesh, Akshay ; Hamidouche, Khaled ; Potluri, Sreeram ; Bureddy, D. ; Panda, Dhabaleswar K.
Author_Institution :
Dept. of Comput. Sci. & Eng., Ohio State Univ., Columbus, OH, USA
fYear :
2013
fDate :
21-23 Aug. 2013
Firstpage :
63
Lastpage :
70
Abstract :
The emergence of co-processors such as Intel Many Integrated Cores (MICs) is changing the landscape of supercomputing. The MIC is a memory constrained environment and its processors also operate at slower clock rates. Furthermore, the communication characteristics between MIC processes are also different compared to communication between host processes. Communication libraries that do not consider these architectural subtleties cannot deliver good communication performance. The performance of MPI collective operations strongly affect the performance of parallel applications. Owing to the challenges introduced by the emerging heterogeneous systems, it is critical to fundamentally re-design collective algorithms to ensure that applications can fully leverage the MIC architecture. In this paper, we propose a generic framework to optimize the performance of important collective operations, such as, MPI Bcast, MPI Reduce and MPI Allreduce, on Intel MIC clusters. We also present a detailed analysis of the compute phases in reduce operations for MIC clusters. To the best of our knowledge, this is the first paper to propose novel designs to improve the performance of collectives on MIC clusters. Our designs improve the latency of the MPI Bcast operation with 4,864 MPI processes by up to 76%. We also observe up to 52.4% improvements in the communication latency of the MPI Allreduce operation with 2K MPI processes on heterogeneous MIC clusters. Our designs also improve the execution time of the WindJammer application by up to 16%.
Keywords :
application program interfaces; clocks; coprocessors; message passing; multiprocessing systems; parallel architectures; performance evaluation; Intel MIC InfiniBand clusters; Intel many integrated core InfiniBand clusters; MIC architecture; MIC process; MPI collective operation performance; MPI_Allreduce; MPI_Bcast operation latency; WindJammer application; clock rates; communication characteristics; communication libraries; communication performance; coprocessors; heterogeneous MIC clusters; heterogeneous systems; memory constrained environment; optimized MPI allreduce; optimized MPI broadcast; supercomputing; Algorithm design and analysis; Clustering algorithms; Coprocessors; Libraries; Microwave integrated circuits; Performance evaluation; Slabs;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
High-Performance Interconnects (HOTI), 2013 IEEE 21st Annual Symposium on
Conference_Location :
San Jose, CA
Type :
conf
DOI :
10.1109/HOTI.2013.26
Filename :
6627737
Link To Document :
بازگشت