Title :
A detailed GPU cache model based on reuse distance theory
Author :
Nugteren, Cedric ; van den Braak, Gert-Jan ; Corporaal, Henk ; Bal, Henri
Author_Institution :
Eindhoven Univ. of Technol., Eindhoven, Netherlands
Abstract :
As modern GPUs rely partly on their on-chip memories to counter the imminent off-chip memory wall, the efficient use of their caches has become important for performance and energy. However, optimising cache locality system-atically requires insight into and prediction of cache behaviour. On sequential processors, stack distance or reuse distance theory is a well-known means to model cache behaviour. However, it is not straightforward to apply this theory to GPUs, mainly because of the parallel execution model and fine-grained multi-threading. This work extends reuse distance to GPUs by modelling: (1) the GPU´s hierarchy of threads, warps, threadblocks, and sets of active threads, (2) conditional and non-uniform latencies, (3) cache associativity, (4) miss-status holding-registers, and (5) warp divergence. We implement the model in C++ and extend the Ocelot GPU emulator to extract lists of memory addresses. We compare our model with measured cache miss rates for the Parboil and PolyBench/GPU benchmark suites, showing a mean absolute error of 6% and 8% for two cache configurations. We show that our model is faster and even more accurate compared to the GPGPU-Sim simulator.
Keywords :
C++ language; benchmark testing; cache storage; graphics processing units; multi-threading; storage allocation; C++ language; GPU cache model; Ocelot GPU emulator; Parboil benchmark suites; PolyBench/GPU benchmark suites; active thread hierarchy; cache associativity; cache behaviour prediction; cache configurations; cache locality optimisation; cache miss rates; conditional nonuniform latencies; fine-grained multithreading; graphics processing units; mean absolute error; memory address list extraction; miss-status holding-registers; parallel execution model; reuse distance theory; sequential processors; stack distance; thread hierarchy; threadblock hierarchy; warp divergence; warp hierarchy; Computer architecture; Data models; Graphics processing units; Instruction sets; Kernel; System-on-chip;
Conference_Titel :
High Performance Computer Architecture (HPCA), 2014 IEEE 20th International Symposium on
Conference_Location :
Orlando, FL
DOI :
10.1109/HPCA.2014.6835955