DocumentCode :
167301
Title :
Optimizing Collective Communication in UPC
Author :
Jose, Jithin ; Hamidouche, Khaled ; Jie Zhang ; Venkatesh, Akshay ; Panda, Dhabaleswar K.
Author_Institution :
Dept. of Comput. Sci. & Eng., Ohio State Univ., Columbus, OH, USA
fYear :
2014
fDate :
19-23 May 2014
Firstpage :
361
Lastpage :
370
Abstract :
Message Passing Interface (MPI) has been the defacto programming model for scientific parallel applications. However, data driven applications with irregular communication patterns are harder to implement using MPI. The Partitioned Global Address Space (PGAS) programming models present an alternative approach to improve programmability. PGAS languages like UPC are growing in popularity because of their ability to provide shared-memory programming model over distributed memory machines. However, since UPC is an emerging standard, it is unlikely that entire applications will be re-written with it. Instead, unified communication runtimes have paved the way for a new class of hybrid applications that can leverage the benefits of both MPI and PGAS models. Such unified runtimes need to be designed in a high performance, scalable manner to improve the performance of emerging hybrid applications. Collective communication primitives offer a flexible, portable way to implement group communication operations and are supported in both MPI and PGAS programming models. Owing to their advantages, they are also widely used across various scientific parallel applications. Over the years, MPI libraries have relied upon aggressive software- /hardware-based and kernel-assisted optimizations to deliver low communication latency for various collective operations. However, there is much room for improvement for collective operations in state-of-the-art, open-source implementations of UPC. In this paper, we address the challenges associated with improving the performance of collective primitives in UPC. Further, we also explore design alternatives to enable collective primitives in UPC to directly leverage the designs available in the MVAPICH2 MPI library. Our experimental evaluations show that our designs improve the performance of the UPC broadcast and all-gather operations, by 25X and 18X respectively for 128KB message at 2,048 processes. Our designs improve the performance of the UPC - D-Heat kernel by up to 2X times at 2,048 processes, and NAS-FT benchmark by 12% at 256 processes.
Keywords :
application program interfaces; message passing; parallel programming; MPI programming model; MVAPICH2 MPI library; NAS-FT benchmark; PGAS programming model; UPC language; communication latency; hardware-based optimization; kernel-assisted optimization; message passing interface; partitioned global address space; scientific parallel application; software-based optimization; Algorithm design and analysis; Benchmark testing; Electronics packaging; Kernel; Libraries; Programming; Runtime; UPC; Collectives; InfiniBand; Programming Models; PGAS;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Parallel & Distributed Processing Symposium Workshops (IPDPSW), 2014 IEEE International
Conference_Location :
Phoenix, AZ
Print_ISBN :
978-1-4799-4117-9
Type :
conf
DOI :
10.1109/IPDPSW.2014.49
Filename :
6969411
Link To Document :
بازگشت