• DocumentCode
    2405944
  • Title

    Application-bypass reduction for large-scale clusters

  • Author

    Wagner, Adam ; Buntinas, Darius ; Panda, Dhabaleswar K. ; Brightwell, Ron

  • Author_Institution
    Dept. of Comput. & Inf. Sci., The Ohio State Univ., Columbus, OH, USA
  • fYear
    2003
  • fDate
    1-4 Dec. 2003
  • Firstpage
    404
  • Lastpage
    411
  • Abstract
    Process skew is an important factor in the performance of parallel applications, especially in large-scale clusters. Reduction is a common collective operation which, by its nature, introduces implicit synchronization between the processes involved in the communication and is therefore highly susceptible to performance degradation due to process skew. A collective operation with application-bypass does not require the application to block in order for the operation to make progress. Application-bypass collective operations are therefore highly tolerant of skew. In this paper we describe the design and implementation of an application-bypass version of the reduction operation in MPICH over GM. We evaluate our implementation on a 16-node cluster. Under conditions of process skew we find a factor of improvement of up to 3.3 for our application-bypass reduction versus the default MPICH implementation. In addition, we see that this factor of improvement increases with system size, indicating that the application-bypass implementation is more scalable and skew-tolerant than the default non-application-bypass version. This framework promises design and development of high-performance and scalable collective communication libraries for next-generation large-scale clusters.
  • Keywords
    computer network management; message passing; parallel processing; performance evaluation; workstation clusters; GM; MPICH; application-bypass reduction; collective operation; large-scale clusters; nonapplication-bypass version; parallel applications; performance degradation; process skew; reduction operation; scalable collective communication libraries; skew-tolerant; synchronization; Application software; Computer network management; Computer networks; Concurrent computing; Degradation; Delay; Information science; Laboratories; Large-scale systems; Libraries; Message passing; Parallel processing; Visualization;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Cluster Computing, 2003. Proceedings. 2003 IEEE International Conference on
  • Print_ISBN
    0-7695-2066-9
  • Type

    conf

  • DOI
    10.1109/CLUSTR.2003.1253340
  • Filename
    1253340