• DocumentCode
    576926
  • Title

    Can Network-Offload Based Non-blocking Neighborhood MPI Collectives Improve Communication Overheads of Irregular Graph Algorithms?

  • Author

    Kandalla, K. ; Buluç, A. ; Subramoni, H. ; Tomko, K. ; Vienne, J. ; Oliker, L. ; Panda, D.K.

  • Author_Institution
    Dept. of Comput. Sci. & Eng., Ohio State Univ., Columbus, OH, USA
  • fYear
    2012
  • fDate
    24-28 Sept. 2012
  • Firstpage
    222
  • Lastpage
    230
  • Abstract
    Graph-based computations are commonly used across various data intensive computing domains ranging from social networks to biological systems. On distributed memory systems, graph algorithms involve explicit communication between processes and often exhibit sparse, irregular behavior. Minimizing these communication overheads is critical to cater to the graph-theoretic analyses demands of emerging âbigdataâ applications. In this paper, we explore the challenges associated with reducing the communication overheads of a popular 2D Breadth First Search (BFS) implementation in the CombBLAS library. This BFS algorithm relies on two common MPI collectives, MPI Alltoallv and MPI Allgathervto exchange data between processes and they account for more than 20% of the overall run time. Re-designing parallel applications to take advantage of MPI-3 non-blocking collectives to achieve latency hiding is an active area of research. However, the 2D BFS algorithm in CombBLAS is not directly amenable for such a re-design through common overlap techniques such as, double-buffering. In this paper, we propose to re-design the BFS algorithm to leverage MPI-3 no blocking, neighborhood collective communication operations to achieve fine-grained computation/communication overlap. We also leverage the CORE-Direct network offload feature in theConnectX-2 InfiniBand adapter from Mellanox to design highly efficient and scalable non-blocking, neighborhood Alltoallv and Allgatherv collective operations. Our experimental evaluations show that we can improve the communication overheads of the2D BFS algorithm by up to 70%, with 1,936 processes.
  • Keywords
    application program interfaces; distributed memory systems; graph theory; libraries; message passing; parallel processing; tree searching; 2D BFS algorithm; 2D breadth first search implementation; BFS implementation; CORE-direct network offload feature; CombBLAS library; ConnectX-2 InfiniBand adapter; MPI Allgatherv; MPI Alltoallv; Mellanox; big data applications; communication overhead minimization; distributed memory systems; double-buffering; graph-based computations; graph-theoretic analysis; intensive computing domains; irregular graph algorithms; latency hiding; message passing interface; neighborhood collective communication operations; network-offload based nonblocking neighborhood MPI collectives; parallel applications; Algorithm design and analysis; Benchmark testing; Clustering algorithms; Libraries; Sorting; Topology; Vectors; 2D BFS Algorithms; Collective Offload and InfiniBand; MPI-3 non-blocking collectives;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Cluster Computing Workshops (CLUSTER WORKSHOPS), 2012 IEEE International Conference on
  • Conference_Location
    Beijing
  • Print_ISBN
    978-1-4673-2893-7
  • Type

    conf

  • DOI
    10.1109/ClusterW.2012.40
  • Filename
    6355868