• DocumentCode
    2766819
  • Title

    ConnectX-2 InfiniBand Management Queues: First Investigation of the New Support for Network Offloaded Collective Operations

  • Author

    Graham, Richard L. ; Poole, Steve ; Shamis, Pavel ; Bloch, Gabriel ; Bloch, Gil ; Chapman, Hillel ; Kagan, Michael ; Shahar, A. ; Rabinovitz, Ishai ; Shainer, G.

  • Author_Institution
    Oak Ridge Nat. Lab. (ORNL), Oak Ridge, TN, USA
  • fYear
    2010
  • fDate
    17-20 May 2010
  • Firstpage
    53
  • Lastpage
    62
  • Abstract
    This paper introduces the newly developed Infini- Band (IB) Management Queue capability, used by the Host Channel Adapter (HCA) to manage network task data flow dependancies, and progress the communications associated with such flows. These tasks include sends, receives, and the newly supported wait task, and are scheduled by the HCA based on a data dependency description provided by the user. This functionality is supported by the ConnectX-2 HCA, and provides the means for delegating collective communication management and progress to the HCA, also known as collective communication offload. This provides a means for overlapping collective communications managed by the HCA and computation on the Central Processing Unit (CPU), thus making it possible to reduce the impact of system noise on parallel applications using collective operations. This paper further describes how this new capability can be used to implement scalable Message Passing Interface (MPI) collective operations, describing the high level details of how this new capability is used to implement the MPI Barrier collective operation, focusing on the latency sensitive performance aspects of this new capability. This paper concludes with small scale bench- mark experiments comparing implementations of the barrier collective operation, using the new network offload capabilities, with established point-to-point based implementations of these same algorithms, which manage the data flow using the central processing unit. These early results demonstrate the promise this new capability provides to improve the scalability of high- performance applications using collective communications. The latency of the HCA based implementation of the barrier is similar to that of the best performing point-to-point based implementation managed by the central processing unit, starting to outperform these as the number of processes involved in the collective operation increases.
  • Keywords
    Broadcasting; Central Processing Unit; Clouds; Computer architecture; Delay; Fabrics; Grid computing; Hardware; Open source software; Scalability;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Cluster, Cloud and Grid Computing (CCGrid), 2010 10th IEEE/ACM International Conference on
  • Conference_Location
    Melbourne, Australia
  • Print_ISBN
    978-1-4244-6987-1
  • Type

    conf

  • DOI
    10.1109/CCGRID.2010.9
  • Filename
    5493494