DocumentCode :
3539864
Title :
CPU-assisted GPGPU on fused CPU-GPU architectures
Author :
Yang, Yi ; Xiang, Ping ; Mantor, Mike ; Zhou, Huiyang
Author_Institution :
Dept. of Electr. & Comput. Eng., North Carolina State Univ., Raleigh, NC, USA
fYear :
2012
fDate :
25-29 Feb. 2012
Firstpage :
1
Lastpage :
12
Abstract :
This paper presents a novel approach to utilize the CPU resource to facilitate the execution of GPGPU programs on fused CPU-GPU architectures. In our model of fused architectures, the GPU and the CPU are integrated on the same die and share the on-chip L3 cache and off-chip memory, similar to the latest Intel Sandy Bridge and AMD accelerated processing unit (APU) platforms. In our proposed CPU-assisted GPGPU, after the CPU launches a GPU program, it executes a pre-execution program, which is generated automatically from the GPU kernel using our proposed compiler algorithms and contains memory access instructions of the GPU kernel for multiple thread-blocks. The CPU pre-execution program runs ahead of GPU threads because (1) the CPU pre-execution thread only contains memory fetch instructions from GPU kernels and not floating-point computations, and (2) the CPU runs at higher frequencies and exploits higher degrees of instruction-level parallelism than GPU scalar cores. We also leverage the prefetcher at the L2-cache on the CPU side to increase the memory traffic from CPU. As a result, the memory accesses of GPU threads hit in the L3 cache and their latency can be drastically reduced. Since our pre-execution is directly controlled by user-level applications, it enjoys both high accuracy and flexibility. Our experiments on a set of benchmarks show that our proposed pre-execution improves the performance by up to 113% and 21.4% on average.
Keywords :
cache storage; graphics processing units; program compilers; AMD accelerated processing unit platforms; CPU preexecution program; CPU resource utilization; CPU-GPU architecture fusion; CPU-assisted GPGPU; GPGPU program execution; GPU kernel; Intel Sandy Bridge; L2-cache; compiler algorithms; instruction-level parallelism; memory access instructions; memory fetch instructions; multiple thread-blocks; off-chip memory; on-chip L3 cache; prefetcher; user-level applications; Central Processing Unit; Computer architecture; Graphics processing unit; Kernel; Prefetching; System-on-a-chip;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
High Performance Computer Architecture (HPCA), 2012 IEEE 18th International Symposium on
Conference_Location :
New Orleans, LA
ISSN :
1530-0897
Print_ISBN :
978-1-4673-0827-4
Electronic_ISBN :
1530-0897
Type :
conf
DOI :
10.1109/HPCA.2012.6168948
Filename :
6168948
Link To Document :
بازگشت