مرکز منطقه ای اطلاع رساني علوم و فناوري - Efficient sparse matrix-vector multiplication on cache-based GPUs

DocumentCode :

1955024

Title :

Efficient sparse matrix-vector multiplication on cache-based GPUs

Author :

Reguly, Istvan ; Giles, Mike

Author_Institution :

Fac. of Inf. Technol., Pazmany Peter Catholic Univ., Budapest, Hungary

fYear :

2012

fDate :

13-14 May 2012

Firstpage :

Lastpage :

Abstract :

Sparse matrix-vector multiplication is an integral part of many scientific algorithms. Several studies have shown that it is a bandwidth-limited operation on current hardware. On cache-based architectures the main factors that influence performance are spatial locality in accessing the matrix, and temporal locality in re-using the elements of the vector. This paper discusses efficient implementations of sparse matrix-vector multiplication on NVIDIA´s Fermi architecture, the first to introduce conventional L1 caches to GPUs. We focus on the compressed sparse row (CSR) format for developing general purpose code. We present a parametrised algorithm, show the effects of parameter tuning on performance and introduce a method for determining the nearoptimal set of parameters that incurs virtually no overhead. On a set of sparse matrices from the University of Florida Sparse Matrix Collection we show an average speed-up of 2.1 times over NVIDIA´s CUSPARSE 4.0 library in single precision and 1.4 times in double precision. Many algorithms require repeated evaluation of sparse matrix-vector products with the same matrix, so we introduce a dynamic run-time auto-tuning system which improves performance by 10-15% in seven iterations. The CSR format is compared to alternative ELLPACK and HYB formats and the cost of conversion is assessed using CUSPARSE. Sparse matrix-vector multiplication performance is also analysed when solving a finite element problem with the conjugate gradient method. We show how problemspecific knowledge can be used to improve performance by up to a factor of two.

Keywords :

cache storage; conjugate gradient methods; finite element analysis; graphics processing units; mathematics computing; matrix multiplication; sparse matrices; vectors; CSR format; CUSPARSE; ELLPACK formats; HYB formats; L1 caches; NVIDIA CUSPARSE 4.0 library; NVIDIA Fermi architecture; University of Florida sparse matrix collection; bandwidth-limited operation; cache-based GPU; cache-based architectures; compressed sparse row format; conjugate gradient method; dynamic run-time autotuning system; finite element problem; general purpose code; sparse matrix-vector multiplication; spatial locality; temporal locality; Algorithm design and analysis; Graphics processing unit; Heuristic algorithms; Instruction sets; Memory management; Sparse matrices; Vectors; autotuning; cache performance; conjugate gradient method; finite element method; sparse matrix-vector multiplication;

fLanguage :

English

Publisher :

ieee

Conference_Titel :

Innovative Parallel Computing (InPar), 2012

Conference_Location :

San Jose, CA

Print_ISBN :

978-1-4673-2632-2

Electronic_ISBN :

978-1-4673-2631-5

Type :

conf

DOI :

10.1109/InPar.2012.6339602

Filename :

6339602

Link To Document :

https://search.ricest.ac.ir/dl/search/defaultta.aspx?DTC=49&DC=1955024