Optimized strategies for mapping three-dimensional FFTs onto CUDA GPUs

Author

Wu, Jing ; Jaja, Joseph

Author_Institution

Dept. of Electr. & Comput. Eng., Univ. of Maryland, College Park, MD, USA

fYear

2012

fDate

13-14 May 2012

Firstpage

1

Lastpage

12

Abstract

We address in this paper the problem of mapping three-dimensional Fast Fourier Transforms (FFTs) onto the recent, highly multithreaded CUDA Graphics Processing Units (GPUs) and present some of the fastest known algorithms for a wide range of 3-D FFTs on the NVIDIA Tesla and Fermi architectures. We exploit the high-degree of multi-threading offered by the CUDA environment while carefully managing the multiple levels of the memory hierarchy in such a way that: (i) all global memory accesses are coalesced into 128-byte device memory transactions issued in such a way as to optimize effects related to partition camping [19], locality [22], and associativity. and (ii) all computations are carried out on the registers with effective data movement involved in shared memory transposition. In particular, the number of global memory accesses to the entire 3-D dataset is minimized and the FFT computations along the X dimension are almost completely overlapped with global memory data transfers needed to compute the FFTs along the Y or Z dimensions. We were able to achieve performance between 135 GFlops and 172 GFlops on the Tesla architecture (Tesla C1060 and GTX280) and between 192 GFlops and 290 GFlops on the Fermi architecture (Tesla C2050 and GTX480). The bandwidths achieved by our algorithms reach over 90 GB/s for the GTX280 and around 140 GB/s for the GTX480.

Keywords

fast Fourier transforms; graphics processing units; multi-threading; parallel architectures; shared memory systems; CUDA GPU; Fermi architectures; NVIDIA Tesla; Tesla architecture; associativity; global memory access; global memory data transfer; locality; memory hierarchy level; multithreaded CUDA graphics processing units; partition camping; shared memory transposition; strategy optimization; three-dimensional FFT mapping; three-dimensional fast Fourier transform mapping; Arrays; Discrete Fourier transforms; Graphics processing unit; Instruction sets; Kernel; Registers; Fast Fourier Transform; GPU; Multi-threaded Algorithms; Scientific Computing;

fLanguage

English

Publisher

ieee

Conference_Titel

Innovative Parallel Computing (InPar), 2012

Conference_Location

San Jose, CA

Print_ISBN

978-1-4673-2632-2

Electronic_ISBN

978-1-4673-2631-5

Type

conf

DOI

10.1109/InPar.2012.6339608

Filename

6339608