DocumentCode :
1955024
Title :
Efficient sparse matrix-vector multiplication on cache-based GPUs
Author :
Reguly, Istvan ; Giles, Mike
Author_Institution :
Fac. of Inf. Technol., Pazmany Peter Catholic Univ., Budapest, Hungary
fYear :
2012
fDate :
13-14 May 2012
Firstpage :
1
Lastpage :
12
Abstract :
Sparse matrix-vector multiplication is an integral part of many scientific algorithms. Several studies have shown that it is a bandwidth-limited operation on current hardware. On cache-based architectures the main factors that influence performance are spatial locality in accessing the matrix, and temporal locality in re-using the elements of the vector. This paper discusses efficient implementations of sparse matrix-vector multiplication on NVIDIA´s Fermi architecture, the first to introduce conventional L1 caches to GPUs. We focus on the compressed sparse row (CSR) format for developing general purpose code. We present a parametrised algorithm, show the effects of parameter tuning on performance and introduce a method for determining the nearoptimal set of parameters that incurs virtually no overhead. On a set of sparse matrices from the University of Florida Sparse Matrix Collection we show an average speed-up of 2.1 times over NVIDIA´s CUSPARSE 4.0 library in single precision and 1.4 times in double precision. Many algorithms require repeated evaluation of sparse matrix-vector products with the same matrix, so we introduce a dynamic run-time auto-tuning system which improves performance by 10-15% in seven iterations. The CSR format is compared to alternative ELLPACK and HYB formats and the cost of conversion is assessed using CUSPARSE. Sparse matrix-vector multiplication performance is also analysed when solving a finite element problem with the conjugate gradient method. We show how problemspecific knowledge can be used to improve performance by up to a factor of two.
Keywords :
cache storage; conjugate gradient methods; finite element analysis; graphics processing units; mathematics computing; matrix multiplication; sparse matrices; vectors; CSR format; CUSPARSE; ELLPACK formats; HYB formats; L1 caches; NVIDIA CUSPARSE 4.0 library; NVIDIA Fermi architecture; University of Florida sparse matrix collection; bandwidth-limited operation; cache-based GPU; cache-based architectures; compressed sparse row format; conjugate gradient method; dynamic run-time autotuning system; finite element problem; general purpose code; sparse matrix-vector multiplication; spatial locality; temporal locality; Algorithm design and analysis; Graphics processing unit; Heuristic algorithms; Instruction sets; Memory management; Sparse matrices; Vectors; autotuning; cache performance; conjugate gradient method; finite element method; sparse matrix-vector multiplication;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Innovative Parallel Computing (InPar), 2012
Conference_Location :
San Jose, CA
Print_ISBN :
978-1-4673-2632-2
Electronic_ISBN :
978-1-4673-2631-5
Type :
conf
DOI :
10.1109/InPar.2012.6339602
Filename :
6339602
Link To Document :
بازگشت