DocumentCode :
2440181
Title :
Implementing the Himeno benchmark with CUDA on GPU clusters
Author :
Phillips, Everett H. ; Fatica, Massimiliano
Author_Institution :
NVIDIA Corp., Santa Clara, CA, USA
fYear :
2010
fDate :
19-23 April 2010
Firstpage :
1
Lastpage :
10
Abstract :
This paper describes the use of CUDA to accelerate the Himeno benchmark on clusters with GPUs. The implementation is designed to optimize memory bandwidth utilization. Our approach achieves over 83% of the theoretical peak bandwidth on a NVIDIA Tesla C1060 GPU and performs at over 50 GFlops. A multi-GPU implementation that utilizes MPI alongside CUDA streams to overlap GPU execution with data transfers allows linear scaling and performs at over 800 GFlops on a cluster with 16 GPUs. The paper presents the optimizations required to achieve this level of performance.
Keywords :
computer graphic equipment; message passing; microprocessor chips; parallel programming; CUDA cluster; GFlops; GPU cluster; GPU execution; Himeno benchmark; MPI; NVIDIA Tesla C1060 GPU; data transfer; linear scaling; memory bandwidth utilization; multiGPU implementation; parallel programming; Acceleration; Bandwidth; Clocks; Convergence; Design optimization; Frequency; Kernel; Navier-Stokes equations; Poisson equations; Throughput;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Parallel & Distributed Processing (IPDPS), 2010 IEEE International Symposium on
Conference_Location :
Atlanta, GA
ISSN :
1530-2075
Print_ISBN :
978-1-4244-6442-5
Type :
conf
DOI :
10.1109/IPDPS.2010.5470394
Filename :
5470394
Link To Document :
بازگشت