DocumentCode :
2255451
Title :
Implementing a Code Generator for Fast Matrix Multiplication in OpenCL on the GPU
Author :
Matsumoto, Kazuya ; Nakasato, Naohito ; Sedukhin, Stanislav G.
Author_Institution :
Grad. Sch. of Comput. Sci. & Eng., Univ. of Aizu Aizu-Wakamatsu City, Aizu-Wakamatsu, Japan
fYear :
2012
fDate :
20-22 Sept. 2012
Firstpage :
198
Lastpage :
204
Abstract :
This paper presents results of an implementation of code generator for fast general matrix multiply (GEMM) kernels. When a set of parameters is given, the code generator produces the corresponding GEMM kernel written in OpenCL. The produced kernels are optimized for high-performance implementation on GPUs from AMD. Access latencies to GPU global memory is the main drawback for high performance. This study shows that storing matrix data in a block-major layout increases the performance and stability of GEMM kernels. On the Tahiti GPU (Radeon HD 7970), our DGEMM (double-precision GEMM) and SGEMM (single-precisionGEMM) kernels achieve the performance up to 848 GFlop/s (90% of the peak) and 2646 GFlop/s (70%), respectively.
Keywords :
graphics processing units; matrix algebra; program compilers; GPU global memory; OpenCL; Radeon HD 7970; SGEMM; code generator; code generator for fast general matrix multiply; fast matrix multiplication; matrix data; single-precision GEMM; Bandwidth; Generators; Graphics processing units; High definition video; Kernel; Layout; Search engines; GPU; OpenCL; auto-tuning; matrix multiplication;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Embedded Multicore Socs (MCSoC), 2012 IEEE 6th International Symposium on
Conference_Location :
Aizu-Wakamatsu
Print_ISBN :
978-1-4673-2535-6
Electronic_ISBN :
978-0-7695-4800-5
Type :
conf
DOI :
10.1109/MCSoC.2012.30
Filename :
6354699
Link To Document :
بازگشت