مرکز منطقه ای اطلاع رساني علوم و فناوري - Unlocking bandwidth for GPUs in CC-NUMA systems

DocumentCode :

1949663

Title :

Unlocking bandwidth for GPUs in CC-NUMA systems

Author :

Agarwal, Neha ; Nellans, David ; O´Connor, Mike ; Keckler, Stephen W. ; Wenisch, Thomas F.

fYear :

2015

fDate :

7-11 Feb. 2015

Firstpage :

354

Lastpage :

365

Abstract :

Historically, GPU-based HPC applications have had a substantial memory bandwidth advantage over CPU-based workloads due to using GDDR rather than DDR memory. However, past GPUs required a restricted programming model where application data was allocated up front and explicitly copied into GPU memory before launching a GPU kernel by the programmer. Recently, GPUs have eased this requirement and now can employ on-demand software page migration between CPU and GPU memory to obviate explicit copying. In the near future, CC-NUMA GPU-CPU systems will appear where software page migration is an optional choice and hardware cache-coherence can also support the GPU accessing CPU memory directly. In this work, we describe the trade-offs and considerations in relying on hardware cache-coherence mechanisms versus using software page migration to optimize the performance of memory-intensive GPU workloads. We show that page migration decisions based on page access frequency alone are a poor solution and that a broader solution using virtual address-based program locality to enable aggressive memory prefetching combined with bandwidth balancing is required to maximize performance. We present a software runtime system requiring minimal hardware support that, on average, outperforms CC-NUMA-based accesses by 1.95 ×, performs 6% better than the legacy CPU to GPU memcpy regime by intelligently using both CPU and GPU memory bandwidth, and comes within 28% of oracular page placement, all while maintaining the relaxed memory semantics of modern GPUs.

Keywords :

cache storage; graphics processing units; parallel processing; storage management; CC-NUMA GPU-CPU systems; CPU memory bandwidth; GDDR memory; GPU kernel; GPU memory bandwidth; GPU relaxed memory semantics; GPU-based HPC applications; aggressive memory prefetching; bandwidth balancing; hardware cache-coherence; memory-intensive GPU workloads; minimal hardware support; on-demand software page migration; oracular page placement; software runtime system; virtual address-based program locality; Bandwidth; Graphics processing units; Hardware; Memory management; Random access memory; Runtime;

fLanguage :

English

Publisher :

ieee

Conference_Titel :

High Performance Computer Architecture (HPCA), 2015 IEEE 21st International Symposium on

Conference_Location :

Burlingame, CA

Type :

conf

DOI :

10.1109/HPCA.2015.7056046

Filename :

7056046

Link To Document :

https://search.ricest.ac.ir/dl/search/defaultta.aspx?DTC=49&DC=1949663