• DocumentCode
    1955024
  • Title

    Efficient sparse matrix-vector multiplication on cache-based GPUs

  • Author

    Reguly, Istvan ; Giles, Mike

  • Author_Institution
    Fac. of Inf. Technol., Pazmany Peter Catholic Univ., Budapest, Hungary
  • fYear
    2012
  • fDate
    13-14 May 2012
  • Firstpage
    1
  • Lastpage
    12
  • Abstract
    Sparse matrix-vector multiplication is an integral part of many scientific algorithms. Several studies have shown that it is a bandwidth-limited operation on current hardware. On cache-based architectures the main factors that influence performance are spatial locality in accessing the matrix, and temporal locality in re-using the elements of the vector. This paper discusses efficient implementations of sparse matrix-vector multiplication on NVIDIA´s Fermi architecture, the first to introduce conventional L1 caches to GPUs. We focus on the compressed sparse row (CSR) format for developing general purpose code. We present a parametrised algorithm, show the effects of parameter tuning on performance and introduce a method for determining the nearoptimal set of parameters that incurs virtually no overhead. On a set of sparse matrices from the University of Florida Sparse Matrix Collection we show an average speed-up of 2.1 times over NVIDIA´s CUSPARSE 4.0 library in single precision and 1.4 times in double precision. Many algorithms require repeated evaluation of sparse matrix-vector products with the same matrix, so we introduce a dynamic run-time auto-tuning system which improves performance by 10-15% in seven iterations. The CSR format is compared to alternative ELLPACK and HYB formats and the cost of conversion is assessed using CUSPARSE. Sparse matrix-vector multiplication performance is also analysed when solving a finite element problem with the conjugate gradient method. We show how problemspecific knowledge can be used to improve performance by up to a factor of two.
  • Keywords
    cache storage; conjugate gradient methods; finite element analysis; graphics processing units; mathematics computing; matrix multiplication; sparse matrices; vectors; CSR format; CUSPARSE; ELLPACK formats; HYB formats; L1 caches; NVIDIA CUSPARSE 4.0 library; NVIDIA Fermi architecture; University of Florida sparse matrix collection; bandwidth-limited operation; cache-based GPU; cache-based architectures; compressed sparse row format; conjugate gradient method; dynamic run-time autotuning system; finite element problem; general purpose code; sparse matrix-vector multiplication; spatial locality; temporal locality; Algorithm design and analysis; Graphics processing unit; Heuristic algorithms; Instruction sets; Memory management; Sparse matrices; Vectors; autotuning; cache performance; conjugate gradient method; finite element method; sparse matrix-vector multiplication;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Innovative Parallel Computing (InPar), 2012
  • Conference_Location
    San Jose, CA
  • Print_ISBN
    978-1-4673-2632-2
  • Electronic_ISBN
    978-1-4673-2631-5
  • Type

    conf

  • DOI
    10.1109/InPar.2012.6339602
  • Filename
    6339602