مرکز منطقه ای اطلاع رساني علوم و فناوري - Fast Implementation of General Matrix-Vector Multiplication (GEMV) on Kepler GPUs

DocumentCode :

2525359

Title :

Fast Implementation of General Matrix-Vector Multiplication (GEMV) on Kepler GPUs

Author :

Mukunoki, Daichi ; Imamura, Toshiyuki ; Takahashi, Daisuke

Author_Institution :

RIKEN Adv. Inst. for Comput. Sci., Kobe, Japan

fYear :

2015

fDate :

4-6 March 2015

Firstpage :

642

Lastpage :

650

Abstract :

This paper proposes a fast implementation method for the general matrix-vector multiplication (GEMV) routine, which is one of the level-2 Basic Linear Algebra Subprograms (BLAS) subroutines, for a column-major and non-transposed matrix on NVIDIA Kepler architecture graphics processing units (GPUs). We began by implementing the GEMV kernel using typical blocking techniques for shared-memory and register along with 128-bit vector load/store instructions. In our initial investigation, we found that even though the kernel could approach actual peak GPU throughput at some matrix sizes, performance fluctuates periodically depending on the problem size. In our next step, we investigated the reason for the fluctuations using a performance model based on a thread-block scheduling mechanism, and then created a method of determining optimal thread-block sizes that avoids those fluctuations. As the results show, when run on two Kepler architecture GPUs, our single-precision GEMV (SGEMV) routine achieved better performance in terms of both throughput and performance stability (with respect to the problem size) when compared to existing implementations: CUBLAS 6.5, MAGMA 1.4.1 and KBLAS 1.0. Our implementation techniques can be used not only for SGEMV but also double-precision (DGEMV), single-complex (CGEMV), and double-complex (ZGEMV). While this paper discusses primarily Kepler architecture, we also explore the performance of proposal implementation on Maxwell architecture, which is the next generation of Kepler architecture.

Keywords :

graphics processing units; matrix algebra; optimisation; parallel architectures; processor scheduling; BLAS subroutines; CUBLAS 6.5; GEMV kernel; KBLAS 1.0; Kepler architecture GPU; MAGMA 1.4.1; Maxwell architecture; NVIDIA Kepler architecture graphics processing units; general matrix-vector multiplication; level-2 basic linear algebra subprograms; shared-memory system; thread-block scheduling mechanism; Computer architecture; Graphics processing units; Instruction sets; Kernel; Matrices; Registers; Throughput; GEMV; GPU; Matrix-vector multiplication; Performance optimization;

fLanguage :

English

Publisher :

ieee

Conference_Titel :

Parallel, Distributed and Network-Based Processing (PDP), 2015 23rd Euromicro International Conference on

Conference_Location :

Turku

ISSN :

1066-6192

Type :

conf

DOI :

10.1109/PDP.2015.66

Filename :

7092787

Link To Document :

https://search.ricest.ac.ir/dl/search/defaultta.aspx?DTC=49&DC=2525359