DocumentCode :
3385620
Title :
To GPU synchronize or not GPU synchronize?
Author :
Feng, Wu-chun ; Xiao, Shucai
Author_Institution :
Dept. of Comput. Sci., Virginia Tech, Blacksburg, VA, USA
fYear :
2010
fDate :
May 30 2010-June 2 2010
Firstpage :
3801
Lastpage :
3804
Abstract :
The graphics processing unit (GPU) has evolved from being a fixed-function processor with programmable stages into a programmable processor with many fixed-function components that deliver massive parallelism. By modifying the GPU´s stream processor to support “general-purpose computation” on the GPU (GPGPU), applications that perform massive vector operations can realize many orders-of-magnitude improvement in performance over a traditional processor, i.e., CPU. However, the breadth of general-purpose computation that can be efficiently supported on a GPU has largely been limited to highly dataparallel or task-parallel applications due to the lack of explicit support for communication between streaming multiprocessors (SMs) on the GPU. Such communication can occur via the global memory of a GPU, but it then requires a barrier synchronization across the SMs of the GPU in order to complete the communication between SMs. Although our previous work demonstrated that implementing barrier synchronization on the GPU itself can significantly improve performance and deliver correct results in critical bioinformatics applications, guaranteeing the correctness of inter-SM communication is only possible if a memory consistency model is assumed. To address this problem, NVIDIA recently introduced the _threadfence() function in CUDA 2.2, a function that can guarantee the correctness of GPU-based inter-SM communication. However, this function currently introduces so much overhead that when using it in (direct) GPU synchronization, GPU synchronization actually performs worse than indirect synchronization via the CPU, thus raising the question of whether “to GPU synchronize or not GPU synchronize?”
Keywords :
computer graphic equipment; coprocessors; parallel architectures; synchronisation; vector processor systems; CUDA; GPU; NVIDIA; barrier synchronization; fixed function processor; general purpose computation; graphic processing unit; programmable processor; streaming multiprocessor; task parallel application; vector operation; Application software; Bioinformatics; Central Processing Unit; Computer architecture; Computer science; Graphical processing unit; Graphics processing unit; Interpolation; Parallel processing; Rendering (computer graphics);
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Circuits and Systems (ISCAS), Proceedings of 2010 IEEE International Symposium on
Conference_Location :
Paris
Print_ISBN :
978-1-4244-5308-5
Electronic_ISBN :
978-1-4244-5309-2
Type :
conf
DOI :
10.1109/ISCAS.2010.5537722
Filename :
5537722
Link To Document :
بازگشت