مرکز منطقه ای اطلاع رساني علوم و فناوري - Exploring the All-to-All Collective Optimization Space with ConnectX CORE-Direct

DocumentCode :

1920716

Title :

Exploring the All-to-All Collective Optimization Space with ConnectX CORE-Direct

Author :

Venkata, Manjunath Gorentla ; Graham, Richard L. ; Ladd, Joshua ; Shamis, Pavel

Author_Institution :

Comput. Sci. & Math. Div., Oak Ridge Nat. Lab., Oak Ridge, TN, USA

fYear :

2012

fDate :

10-13 Sept. 2012

Firstpage :

289

Lastpage :

298

Abstract :

The all-to-all collective communication operation is used by many scientific applications, and is one of the most time consuming and challenging collective operation to optimize. The algorithms for all-to-all operations typically fall into two classes, logarithmic and linear scaling algorithms, with Bruck´s algorithm, a logarithmic scaling algorithm, used in many small-data all-to-all implementations. The recent addition of InfiniBand CORE-Direct support for network management of collective communications offers new opportunities for optimizing all-to-all operation as well as supporting truly asynchronous implementations of these operations. This paper presents several new enhancements to the Bruck small-data algorithm that leverage CORE-Direct and other InfiniBand network capabilities to produce efficient implementations of this collective operation. These include RDMA, SR-RNR, and SR-RTR algorithms. In addition, nonblocking implementations of these collective operations are also presented. Benchmark results show that the RDMA algorithm, which uses CORE-Direct capabilities to offload collective communication management to the Host Channel Adapter (HCA), hardware gather support for sending non-continuous data, and low-latency RDMA semantics, performs the best. For a 64 processes and 128 byte-per-process all-to-all, the RDMA algorithm performs 27% better than Bruck´s algorithm implementation in Open MPI and 136% better than the SR-RTR algorithm. In addition, the nonblocking versions of these algorithms have the same performance characteristics as the blocking algorithms. Finally, measurements of computation/communication overlap capacity show that all offloaded algorithms achieve about 98% overlap for large data all-to-all, whereas implementations using host-based progress achieve only about 9.5% overlap.

Keywords :

computer network management; computer network performance evaluation; optimisation; Bruck small-data algorithm; CORE-Direct capabilities; ConnectX CORE-direct; HCA; InfiniBand CORE-Direct support; InfiniBand network capabilities; Open MPI; RDMA algorithm; all-to-all collective communication operation; all-to-all collective optimization space; asynchronous implementations; blocking algorithms; collective communication management; computation-communication overlap capacity; host channel adapter; host-based progress; large data all-to-all; linear scaling algorithms; logarithmic scaling algorithms; low-latency RDMA semantics; network management; nonblocking algorithms; noncontinuous data; offloaded algorithms; performance characteristics; small data all-to-all implementations; Algorithm design and analysis; Benchmark testing; Hardware; Optimization; Receivers; Semantics; Alltoall; Collective Operations; Communication; ConnectX Core-Direct; High Performance Computing; InfiniBand; MPI;

fLanguage :

English

Publisher :

ieee

Conference_Titel :

Parallel Processing (ICPP), 2012 41st International Conference on

Conference_Location :

Pittsburgh, PA

ISSN :

0190-3918

Print_ISBN :

978-1-4673-2508-0

Type :

conf

DOI :

10.1109/ICPP.2012.28

Filename :

6337590

Link To Document :

https://search.ricest.ac.ir/dl/search/defaultta.aspx?DTC=49&DC=1920716