DocumentCode
1955128
Title
Optimized strategies for mapping three-dimensional FFTs onto CUDA GPUs
Author
Wu, Jing ; Jaja, Joseph
Author_Institution
Dept. of Electr. & Comput. Eng., Univ. of Maryland, College Park, MD, USA
fYear
2012
fDate
13-14 May 2012
Firstpage
1
Lastpage
12
Abstract
We address in this paper the problem of mapping three-dimensional Fast Fourier Transforms (FFTs) onto the recent, highly multithreaded CUDA Graphics Processing Units (GPUs) and present some of the fastest known algorithms for a wide range of 3-D FFTs on the NVIDIA Tesla and Fermi architectures. We exploit the high-degree of multi-threading offered by the CUDA environment while carefully managing the multiple levels of the memory hierarchy in such a way that: (i) all global memory accesses are coalesced into 128-byte device memory transactions issued in such a way as to optimize effects related to partition camping [19], locality [22], and associativity. and (ii) all computations are carried out on the registers with effective data movement involved in shared memory transposition. In particular, the number of global memory accesses to the entire 3-D dataset is minimized and the FFT computations along the X dimension are almost completely overlapped with global memory data transfers needed to compute the FFTs along the Y or Z dimensions. We were able to achieve performance between 135 GFlops and 172 GFlops on the Tesla architecture (Tesla C1060 and GTX280) and between 192 GFlops and 290 GFlops on the Fermi architecture (Tesla C2050 and GTX480). The bandwidths achieved by our algorithms reach over 90 GB/s for the GTX280 and around 140 GB/s for the GTX480.
Keywords
fast Fourier transforms; graphics processing units; multi-threading; parallel architectures; shared memory systems; CUDA GPU; Fermi architectures; NVIDIA Tesla; Tesla architecture; associativity; global memory access; global memory data transfer; locality; memory hierarchy level; multithreaded CUDA graphics processing units; partition camping; shared memory transposition; strategy optimization; three-dimensional FFT mapping; three-dimensional fast Fourier transform mapping; Arrays; Discrete Fourier transforms; Graphics processing unit; Instruction sets; Kernel; Registers; Fast Fourier Transform; GPU; Multi-threaded Algorithms; Scientific Computing;
fLanguage
English
Publisher
ieee
Conference_Titel
Innovative Parallel Computing (InPar), 2012
Conference_Location
San Jose, CA
Print_ISBN
978-1-4673-2632-2
Electronic_ISBN
978-1-4673-2631-5
Type
conf
DOI
10.1109/InPar.2012.6339608
Filename
6339608
Link To Document