High Performance Design and Implementation of Nemesis Communication Layer for Two-Sided and One-Sided MPI Semantics in MVAPICH2

Author

Luo, Miao ; Potluri, Sreeram ; Lai, Ping ; Mancini, Emilio P. ; Subramoni, Hari ; Kandalla, Krishna ; Sur, Sayantan ; Panda, Dhabaleswar K.

Author_Institution

Dept. of Comput. Sci. & Eng., Ohio State Univ., Columbus, OH, USA

fYear

2010

fDate

13-16 Sept. 2010

Firstpage

377

Lastpage

386

Abstract

High End Computing (HEC) systems are being deployed with eight to sixteen compute cores, with 64 to 128 cores/node being envisioned for exascale systems. MVAPICH2 is a popular implementation of MPI-2 specifically designed and optimized for InfiniBand, iWARP and RDMA over Converged Ethernet (RoCE). MVAPICH2 is based on MPICH2 from ANL. Recently MPICH2 has been redesigned with an effort to optimize intra-node communication for future many-core systems. The new communication layer in MPICH2 is called Nemesis, which is very well optimized for shared memory message passing, with a modular design for various high-performance interconnects. In this paper we explore the challenges involved in designing the next-generation MVAPICH2 stack, leveraging the Nemesis communication layer. We observe that Nemesis does not provide abstractions for one-sided communication. We propose an extended Nemesis interface for optimized one-sided communication and provide design details. Our experimental evaluation shows that our proposed one-sided interface extensions are able to provide significantly better performance than the basic Nemesis interface. For example, inter-node MPI_Put bandwidth increased from 1,800 MB/s to 3,000 MB/s and latency for small messages went down by 13%. Additionally, with our proposed designs, we are able to demonstrate performance gains with small messages, when compared to the existing MVAPICH2 CH3 implementation. The designs proposed in this paper is a superset of currently available options to MVAPICH2 users and provides the best combination of performance and modularity.

Keywords

application program interfaces; local area networks; message passing; HEC systems; InfiniBand; RDMA; converged Ethernet; exascale systems; high end computing system; high performance design; high-performance interconnects; iWARP; intranode communication; many-core systems; nemesis communication layer; next-generation MVAPICH2 stack; one-sided MPI semantics; optimized one-sided communication; shared memory message passing; two-sided MPI semantics; Ethernet networks; Hardware; Open source software; Optimization; Semantics; Sockets; Synchronization; MPICH2; MVAPICH2; RMA;

fLanguage

English

Publisher

ieee

Conference_Titel

Parallel Processing Workshops (ICPPW), 2010 39th International Conference on

Conference_Location

San Diego, CA

ISSN

1530-2016

Print_ISBN

978-1-4244-7918-4

Electronic_ISBN

1530-2016

Type

conf

DOI

10.1109/ICPPW.2010.58

Filename

5599096