Revealing Critical Loads and Hidden Data Locality in GPGPU Applications

Author

Gunjae Koo;Hyeran Jeon;Murali Annavaram

Author_Institution

Ming Hsieh Dept. of Electr. Eng., Univ. of Southern California, Los Angeles, CA, USA

fYear

2015

Firstpage

120

Lastpage

129

Abstract

In graphics processing units (GPUs), memory access latency is one of the most critical performance hurdles. Several warp schedulers and memory prefetching algorithms have been proposed to avoid the long memory access latency. Prior application characterization studies shed light on the interaction between applications, GPU micro architecture and memory subsystem behavior. Most of these studies, however, only present aggregate statistics on how memory system behaves over the entire application run. In particular, they do not consider how individual load instructions in a program contribute to the aggregate memory system behavior. The analysis presented in this paper shows that there are two distinct classes of load instructions, categorized as deterministic and non-deterministic loads. Using a combination of profiling data from a real GPU card and cycle accurate simulation data we show that there is a significant performance impact disparity when executing these two types of loads. We discuss and suggest several approaches to treat these two load categories differently within the GPU micro architecture for optimizing memory system performance.

Keywords

"Graphics processing units","Instruction sets","Hardware","Registers","Image processing","Kernel","Microarchitecture"

Publisher

ieee

Conference_Titel

Workload Characterization (IISWC), 2015 IEEE International Symposium on

Type

conf

DOI

10.1109/IISWC.2015.23

Filename

7314158