Title :
Reducing Vector I/O for Faster GPU Sparse Matrix-Vector Multiplication
Author :
Pham Nguyen Quang Anh ; Rui Fan ; Yonggang Wen
Author_Institution :
Sch. of Comput. Eng., Nanyang Technol. Univ., Singapore, Singapore
Abstract :
Sparse matrix-vector multiplication (SpMV) is an important kernel used in solving many scientific and engineering problems. The massive parallelism of graphics processing units (GPUs) makes them well suited for SpMV computations. However, fully utilizing the power of GPUs is challenging because SpMV makes a large number of scattered memory accesses which saturate the GPU´s memory bandwidth. Most previous works sought to address the bandwidth limitation by using efficient storage formats for the matrix. However, we show that for most matrices, a majority of the bandwidth is consumed by accesses to the vector. In this paper, we introduce two techniques to significantly decrease the I/O for vector accesses, by making novel use of the GPU´s fast shared memory. A key advantage of our vector optimizations is that they are complementary to existing matrix I/O optimizations, so that it is possible to use both techniques in conjunction. Furthermore, combining the optimizations requires only minor code changes. We demonstrate how to combine our techniques with the widely used CUSP SpMV algorithm and the currently highest performing yaSpMV algorithm to significantly improve both algorithms´ performance. We experimented with a wide range of matrices, and show that the modified version of CUSP on average reduces vector I/O by 37% and reduces the total I/O by 31%, while the modified version of yaSpMV reduces the vector and total I/O by 36% and 31%, resp. We improve CUSP´s total throughput by 14% on average and up to 77% for certain matrices, and improve yaSpMV´s throughput by 12% on average and 35% for some matrices.
Keywords :
graphics processing units; matrix multiplication; parallel processing; sparse matrices; storage management; vector processor systems; CUSP SpMV algorithm; GPU fast shared memory; SpMV computations; bandwidth limitation; graphics processing units; matrix I/O optimizations; memory bandwidth; parallelism; scattered memory accesses; sparse matrix-vector multiplication; storage formats; vector I/O; vector optimizations; yaSpMV algorithm; Bandwidth; Graphics processing units; Indexes; Instruction sets; Kernel; Sparse matrices; Throughput;
Conference_Titel :
Parallel and Distributed Processing Symposium (IPDPS), 2015 IEEE International
Conference_Location :
Hyderabad
DOI :
10.1109/IPDPS.2015.100