Abstract :
Although graphics processing units (GPUs) rely on thread-level parallelism to hide long off-chip memory access latency, judicious utilization of on-chip memory resources, including register files, shared memory, and data caches, is critical to application performance. However, explicitly managing GPU on-chip memory resources is a non-trivial task for application developers. More importantly, as on-chip memory resources vary among different GPU generations, performance portability has become a daunting challenge. In this paper, we tackle this problem with compiler-driven automatic data placement. We focus on programs that have already been reasonably optimized either manually by programmers or automatically by compiler tools. Our proposed compiler algorithms refine these programs by revising data placement across different types of GPU on-chip resources to achieve both performance enhancement and performance portability. Among 12 benchmarks in our study, our proposed compiler algorithm improves the performance by 1.76× on average on Nvidia GTX480, and by 1.61× on average on GTX680.
Keywords :
graphics processing units; performance evaluation; shared memory systems; GPU onchip memory resources; Nvidia GTX480; automatic data placement; compiler algorithm; data caches; graphics processing units; offchip memory access; onchip memory resources; performance enhancement; performance portability; register files; shared memory; Arrays; Bandwidth; Graphics processing units; Instruction sets; Registers; System-on-chip;