DocumentCode :
125679
Title :
A Portable and High-Performance General Matrix-Multiply (GEMM) Library for GPUs and Single-Chip CPU/GPU Systems
Author :
Garg, Radhika ; Hendren, Laurie
Author_Institution :
Sch. of Comput. Sci., McGill Univ., Montreal, QC, Canada
fYear :
2014
fDate :
12-14 Feb. 2014
Firstpage :
672
Lastpage :
680
Abstract :
OpenCL is a vendor neutral and portable interface for programming parallel compute devices such as GPUs. Tuning OpenCL implementations of important library functions such as dense general matrix multiply (GEMM) for a particular device is a difficult problem. Further, OpenCL kernels tuned for a particular architecture perform poorly on other architectures. We present a solution to the challenge of writing a portable and high-performance GEMM implementation. We designed and implemented RaijinCL, an OpenCL auto-tuning library for real and complex variants of GEMM that automatically generates tuned kernels for a given architecture. We comprehensively tested our library on a wide variety of architectures and show that the library is competitive with vendor libraries on all tested architectures. We also implemented an autotuner for hybrid CPU+GPU GEMM that takes advantage of both the CPU and GPU on singlechip CPU+GPU platforms such as Intel Ivy Bridge. We show that our solution can outperform CPU-only, GPU-only as well as simple CPU+GPU tuning strategies. In addition to performance results, we provide analysis of architectural limitations as well as OpenCL compiler and runtime issues discovered on various systems, along with guidance on avoiding some of these issues.
Keywords :
graphics processing units; parallel programming; software libraries; OpenCL auto-tuning library; OpenCL compiler; RaijinCL; high performance general matrix-multiply library; hybrid CPU+GPU GEMM; runtime issues; singlechip CPU+GPU platforms; vendor libraries; Computer architecture; Graphics processing units; Kernel; Libraries; Performance evaluation; Registers; Tiles; BLAS; CUDA; GEMM; GPGPU; Ivy Bridge; OpenCL; autotuning; heterogeneous computing;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Parallel, Distributed and Network-Based Processing (PDP), 2014 22nd Euromicro International Conference on
Conference_Location :
Torino
ISSN :
1066-6192
Type :
conf
DOI :
10.1109/PDP.2014.40
Filename :
6787346
Link To Document :
بازگشت