مرکز منطقه ای اطلاع رساني علوم و فناوري - Implementing the Himeno benchmark with CUDA on GPU clusters

DocumentCode :

2440181

Title :

Implementing the Himeno benchmark with CUDA on GPU clusters

Author :

Phillips, Everett H. ; Fatica, Massimiliano

Author_Institution :

NVIDIA Corp., Santa Clara, CA, USA

fYear :

2010

fDate :

19-23 April 2010

Firstpage :

Lastpage :

Abstract :

This paper describes the use of CUDA to accelerate the Himeno benchmark on clusters with GPUs. The implementation is designed to optimize memory bandwidth utilization. Our approach achieves over 83% of the theoretical peak bandwidth on a NVIDIA Tesla C1060 GPU and performs at over 50 GFlops. A multi-GPU implementation that utilizes MPI alongside CUDA streams to overlap GPU execution with data transfers allows linear scaling and performs at over 800 GFlops on a cluster with 16 GPUs. The paper presents the optimizations required to achieve this level of performance.

Keywords :

computer graphic equipment; message passing; microprocessor chips; parallel programming; CUDA cluster; GFlops; GPU cluster; GPU execution; Himeno benchmark; MPI; NVIDIA Tesla C1060 GPU; data transfer; linear scaling; memory bandwidth utilization; multiGPU implementation; parallel programming; Acceleration; Bandwidth; Clocks; Convergence; Design optimization; Frequency; Kernel; Navier-Stokes equations; Poisson equations; Throughput;

fLanguage :

English

Publisher :

ieee

Conference_Titel :

Parallel & Distributed Processing (IPDPS), 2010 IEEE International Symposium on

Conference_Location :

Atlanta, GA

ISSN :

1530-2075

Print_ISBN :

978-1-4244-6442-5

Type :

conf

DOI :

10.1109/IPDPS.2010.5470394

Filename :

5470394

Link To Document :

https://search.ricest.ac.ir/dl/search/defaultta.aspx?DTC=49&DC=2440181