Abstract :
For most applications, taking full advantage of the memory system is key to achieving good performance on GPUs. In this paper, we introduce register caching, a novel idea where registers of multiple threads are combined and used as a shared, last level, manually managed cache for the contributing threads. This method is enabled by the shuffle instruction recently introduced in Nvidia´s Kepler GPU architecture, which allows threads in the same warp to exchange data directly, previously only possible by going through shared memory. We evaluate our proposal with a stencil computation benchmark, achieving speedups of up to 2.04, compared to using shared memory on a GTX680 GPU. Stencil computations form the core of many scientific applications, which can therefore benefit from our proposal. Furthermore, our method is not limited to stencil computations, but is applicable to any application with a predictable memory access pattern suitable for manual caching.
Keywords :
cache storage; graphics processing units; shared memory systems; GTX680 GPU; Nvidia Kepler GPU architecture; data exchange; graphics processing unit; register caching; shared memory system; shuffle instruction; stencil computation; Benchmark testing; Computer architecture; Graphics processing units; Indexes; Instruction sets; Manuals; Registers; CUDA; Caching; GPU; GPU Computing; Register Caching; Stencil Computations;