Title :
Implementing the Himeno benchmark with CUDA on GPU clusters
Author :
Phillips, Everett H. ; Fatica, Massimiliano
Author_Institution :
NVIDIA Corp., Santa Clara, CA, USA
Abstract :
This paper describes the use of CUDA to accelerate the Himeno benchmark on clusters with GPUs. The implementation is designed to optimize memory bandwidth utilization. Our approach achieves over 83% of the theoretical peak bandwidth on a NVIDIA Tesla C1060 GPU and performs at over 50 GFlops. A multi-GPU implementation that utilizes MPI alongside CUDA streams to overlap GPU execution with data transfers allows linear scaling and performs at over 800 GFlops on a cluster with 16 GPUs. The paper presents the optimizations required to achieve this level of performance.
Keywords :
computer graphic equipment; message passing; microprocessor chips; parallel programming; CUDA cluster; GFlops; GPU cluster; GPU execution; Himeno benchmark; MPI; NVIDIA Tesla C1060 GPU; data transfer; linear scaling; memory bandwidth utilization; multiGPU implementation; parallel programming; Acceleration; Bandwidth; Clocks; Convergence; Design optimization; Frequency; Kernel; Navier-Stokes equations; Poisson equations; Throughput;
Conference_Titel :
Parallel & Distributed Processing (IPDPS), 2010 IEEE International Symposium on
Conference_Location :
Atlanta, GA
Print_ISBN :
978-1-4244-6442-5
DOI :
10.1109/IPDPS.2010.5470394