Title :
Exploring the All-to-All Collective Optimization Space with ConnectX CORE-Direct
Author :
Venkata, Manjunath Gorentla ; Graham, Richard L. ; Ladd, Joshua ; Shamis, Pavel
Author_Institution :
Comput. Sci. & Math. Div., Oak Ridge Nat. Lab., Oak Ridge, TN, USA
Abstract :
The all-to-all collective communication operation is used by many scientific applications, and is one of the most time consuming and challenging collective operation to optimize. The algorithms for all-to-all operations typically fall into two classes, logarithmic and linear scaling algorithms, with Bruck´s algorithm, a logarithmic scaling algorithm, used in many small-data all-to-all implementations. The recent addition of InfiniBand CORE-Direct support for network management of collective communications offers new opportunities for optimizing all-to-all operation as well as supporting truly asynchronous implementations of these operations. This paper presents several new enhancements to the Bruck small-data algorithm that leverage CORE-Direct and other InfiniBand network capabilities to produce efficient implementations of this collective operation. These include RDMA, SR-RNR, and SR-RTR algorithms. In addition, nonblocking implementations of these collective operations are also presented. Benchmark results show that the RDMA algorithm, which uses CORE-Direct capabilities to offload collective communication management to the Host Channel Adapter (HCA), hardware gather support for sending non-continuous data, and low-latency RDMA semantics, performs the best. For a 64 processes and 128 byte-per-process all-to-all, the RDMA algorithm performs 27% better than Bruck´s algorithm implementation in Open MPI and 136% better than the SR-RTR algorithm. In addition, the nonblocking versions of these algorithms have the same performance characteristics as the blocking algorithms. Finally, measurements of computation/communication overlap capacity show that all offloaded algorithms achieve about 98% overlap for large data all-to-all, whereas implementations using host-based progress achieve only about 9.5% overlap.
Keywords :
computer network management; computer network performance evaluation; optimisation; Bruck small-data algorithm; CORE-Direct capabilities; ConnectX CORE-direct; HCA; InfiniBand CORE-Direct support; InfiniBand network capabilities; Open MPI; RDMA algorithm; all-to-all collective communication operation; all-to-all collective optimization space; asynchronous implementations; blocking algorithms; collective communication management; computation-communication overlap capacity; host channel adapter; host-based progress; large data all-to-all; linear scaling algorithms; logarithmic scaling algorithms; low-latency RDMA semantics; network management; nonblocking algorithms; noncontinuous data; offloaded algorithms; performance characteristics; small data all-to-all implementations; Algorithm design and analysis; Benchmark testing; Hardware; Optimization; Receivers; Semantics; Alltoall; Collective Operations; Communication; ConnectX Core-Direct; High Performance Computing; InfiniBand; MPI;
Conference_Titel :
Parallel Processing (ICPP), 2012 41st International Conference on
Conference_Location :
Pittsburgh, PA
Print_ISBN :
978-1-4673-2508-0
DOI :
10.1109/ICPP.2012.28