Dymaxion: Optimizing memory access patterns for heterogeneous systems

Author

Che, Shuai ; Sheaffer, Jeremy W. ; Skadron, Kevin

Author_Institution

Dept. of Comput. Sci., Univ. of Virginia, Charlottesville, VA, USA

fYear

2011

fDate

12-18 Nov. 2011

Firstpage

1

Lastpage

11

Abstract

Graphics processors (GPUs) have emerged as an important platform for general purpose computing. GPUs offer a large number of parallel cores and have access to high memory bandwidth; however, data structure layouts in GPU memory often lead to sub optimal performance for programs designed with a CPU memory interface-or no particular memory interface at all!-in mind. This implies that application performance is highly sensitive irregularity in memory access patterns. This issue is all the more important due to the growing disparity between core and DRAM clocks; memory interfaces have increasingly become bottlenecks in computer systems. In this paper, we propose a simple API, Dymaxion , that allows programmers to optimize memory mappings to improve the efficiency of memory accesses on heterogeneous platforms. Use of Dymaxion requires only minimal modifications to existing CUDA programs. Our current framework extends NVIDIA´s CUDA API with the addition of memory layout remapping and index transformation. We consider the overhead of layout remapping and effectively hide it through chunking and overlapping with PCI-E transfer. We present the implementation of Dymaxion and its optimizations and evaluate a variety of important memory access patterns. Using four case studies, we are able to achieve 3.3x speedup on GPU kernels and 20% overall performance improvement, including the PCI-E transfer, over the original CUDA implementations on an NVIDIA GTX 480 GPU. We also explore the importance of maintaining per-device data layouts and cross-device data map pings with a case study of concurrent CPU-GPU execution.

Keywords

DRAM chips; application program interfaces; concurrency control; coprocessors; data structures; parallel architectures; API; CPU memory interface; CUDA program; DRAM clocks; Dymaxion; GPU memory; NVIDIA GTX 480 GPU; PCI-E transfer; concurrent CPU-GPU execution; cross-device data map pings; data structure layouts; general purpose computing; graphics processor; heterogeneous system; index transformation; memory access pattern optimisation; memory access patterns; memory layout remapping; parallel cores; per-device data layouts; Arrays; Graphics processing unit; Indexes; Instruction sets; Kernel; Layout; GPGPU; Heterogeneous Computer Architectures; Latency Hiding; Memory Access and Data Layout;

fLanguage

English

Publisher

ieee

Conference_Titel

High Performance Computing, Networking, Storage and Analysis (SC), 2011 International Conference for

Conference_Location

Seatle, WA

Electronic_ISBN

978-1-4503-0771-0

Type

conf

Filename

6114426