مرکز منطقه ای اطلاع رساني علوم و فناوري - Communication Optimization of Iterative Sparse Matrix-Vector Multiply on GPUs and FPGAs

DocumentCode :

71761

Title :

Communication Optimization of Iterative Sparse Matrix-Vector Multiply on GPUs and FPGAs

Author :

Rafique, Aasim ; Constantinides, George A. ; Kapre, Nachiket

Author_Institution :

Dept. of Electr. & Electron. Eng., Imperial Coll. London, London, UK

Volume :

Issue :

fYear :

2015

fDate :

Jan. 2015

Firstpage :

Lastpage :

Abstract :

Trading communication with redundant computation can increase the silicon efficiency of FPGAs and GPUs in accelerating communication-bound sparse iterative solvers. While k iterations of the iterative solver can be unrolled to provide O(k) reduction in communication cost, the extent of this unrolling depends on the underlying architecture, its memory model, and the growth in redundant computation. This paper presents a systematic procedure to select this algorithmic parameter k, which provides communication-computation tradeoff on hardware accelerators like FPGA and GPU. We provide predictive models to understand this tradeoff and show how careful selection of k can lead to performance improvement that otherwise demands significant increase in memory bandwidth. On an Nvidia C2050 GPU, we demonstrate a 1.9×-42.6× speedup over standard iterative solvers for a range of benchmarks and that this speedup is limited by the growth in redundant computation. In contrast, for FPGAs, we present an architecture-aware algorithm that limits off-chip communication but allows communication between the processing cores. This reduces redundant computation and allows large k and hence higher speedups. Our approach for FPGA provides a 0.3×-4.4× speedup over same-generation GPU devices where k is picked carefully for both architectures for a range of benchmarks.

Keywords :

field programmable gate arrays; graphics processing units; iterative methods; mathematics computing; matrix multiplication; FPGA; Nvidia C2050 GPU; algorithmic parameter; architecture-aware algorithm; communication cost reduction; communication optimization; communication-bound sparse iterative solvers; communication-computation tradeoff; field programmable gate array; graphics processing unit; hardware accelerators; iterative sparse matrix-vector multiplication; off-chip communication; processing core communication; Field programmable gate arrays; Graphics processing units; Instruction sets; Kernel; Registers; Sparse matrices; Vectors; Iterative numerical methods; field programmable gate arrays (FPGAs); graphics processing units (GPUs); matrix powers kernel; spare matrix-vector multiply;

fLanguage :

English

Journal_Title :

Parallel and Distributed Systems, IEEE Transactions on

Publisher :

ieee

ISSN :

1045-9219

Type :

jour

DOI :

10.1109/TPDS.2014.6

Filename :

6719387

Link To Document :

https://search.ricest.ac.ir/dl/search/defaultta.aspx?DTC=49&DC=71761