DocumentCode :
2405944
Title :
Application-bypass reduction for large-scale clusters
Author :
Wagner, Adam ; Buntinas, Darius ; Panda, Dhabaleswar K. ; Brightwell, Ron
Author_Institution :
Dept. of Comput. & Inf. Sci., The Ohio State Univ., Columbus, OH, USA
fYear :
2003
fDate :
1-4 Dec. 2003
Firstpage :
404
Lastpage :
411
Abstract :
Process skew is an important factor in the performance of parallel applications, especially in large-scale clusters. Reduction is a common collective operation which, by its nature, introduces implicit synchronization between the processes involved in the communication and is therefore highly susceptible to performance degradation due to process skew. A collective operation with application-bypass does not require the application to block in order for the operation to make progress. Application-bypass collective operations are therefore highly tolerant of skew. In this paper we describe the design and implementation of an application-bypass version of the reduction operation in MPICH over GM. We evaluate our implementation on a 16-node cluster. Under conditions of process skew we find a factor of improvement of up to 3.3 for our application-bypass reduction versus the default MPICH implementation. In addition, we see that this factor of improvement increases with system size, indicating that the application-bypass implementation is more scalable and skew-tolerant than the default non-application-bypass version. This framework promises design and development of high-performance and scalable collective communication libraries for next-generation large-scale clusters.
Keywords :
computer network management; message passing; parallel processing; performance evaluation; workstation clusters; GM; MPICH; application-bypass reduction; collective operation; large-scale clusters; nonapplication-bypass version; parallel applications; performance degradation; process skew; reduction operation; scalable collective communication libraries; skew-tolerant; synchronization; Application software; Computer network management; Computer networks; Concurrent computing; Degradation; Delay; Information science; Laboratories; Large-scale systems; Libraries; Message passing; Parallel processing; Visualization;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Cluster Computing, 2003. Proceedings. 2003 IEEE International Conference on
Print_ISBN :
0-7695-2066-9
Type :
conf
DOI :
10.1109/CLUSTR.2003.1253340
Filename :
1253340
Link To Document :
بازگشت