• DocumentCode
    560160
  • Title

    Dymaxion: Optimizing memory access patterns for heterogeneous systems

  • Author

    Che, Shuai ; Sheaffer, Jeremy W. ; Skadron, Kevin

  • Author_Institution
    Dept. of Comput. Sci., Univ. of Virginia, Charlottesville, VA, USA
  • fYear
    2011
  • fDate
    12-18 Nov. 2011
  • Firstpage
    1
  • Lastpage
    11
  • Abstract
    Graphics processors (GPUs) have emerged as an important platform for general purpose computing. GPUs offer a large number of parallel cores and have access to high memory bandwidth; however, data structure layouts in GPU memory often lead to sub optimal performance for programs designed with a CPU memory interface-or no particular memory interface at all!-in mind. This implies that application performance is highly sensitive irregularity in memory access patterns. This issue is all the more important due to the growing disparity between core and DRAM clocks; memory interfaces have increasingly become bottlenecks in computer systems. In this paper, we propose a simple API, Dymaxion , that allows programmers to optimize memory mappings to improve the efficiency of memory accesses on heterogeneous platforms. Use of Dymaxion requires only minimal modifications to existing CUDA programs. Our current framework extends NVIDIA´s CUDA API with the addition of memory layout remapping and index transformation. We consider the overhead of layout remapping and effectively hide it through chunking and overlapping with PCI-E transfer. We present the implementation of Dymaxion and its optimizations and evaluate a variety of important memory access patterns. Using four case studies, we are able to achieve 3.3x speedup on GPU kernels and 20% overall performance improvement, including the PCI-E transfer, over the original CUDA implementations on an NVIDIA GTX 480 GPU. We also explore the importance of maintaining per-device data layouts and cross-device data map pings with a case study of concurrent CPU-GPU execution.
  • Keywords
    DRAM chips; application program interfaces; concurrency control; coprocessors; data structures; parallel architectures; API; CPU memory interface; CUDA program; DRAM clocks; Dymaxion; GPU memory; NVIDIA GTX 480 GPU; PCI-E transfer; concurrent CPU-GPU execution; cross-device data map pings; data structure layouts; general purpose computing; graphics processor; heterogeneous system; index transformation; memory access pattern optimisation; memory access patterns; memory layout remapping; parallel cores; per-device data layouts; Arrays; Graphics processing unit; Indexes; Instruction sets; Kernel; Layout; GPGPU; Heterogeneous Computer Architectures; Latency Hiding; Memory Access and Data Layout;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    High Performance Computing, Networking, Storage and Analysis (SC), 2011 International Conference for
  • Conference_Location
    Seatle, WA
  • Electronic_ISBN
    978-1-4503-0771-0
  • Type

    conf

  • Filename
    6114426