Application Composition and Communication Optimization in Iterative Solvers Using FPGAs

Author

Rafique, Aasim ; Kapre, Nachiket ; Constantinides, George A.

Author_Institution

Dept. of Electr. & Electron. Eng., Imperial Coll. London, London, UK

fYear

2013

fDate

28-30 April 2013

Firstpage

153

Lastpage

160

Abstract

We consider the problem of minimizing communication with off-chip memory and composition of multiple linear algebra kernels in iterative solvers for solving large-scale eigenvalue problems and linear systems of equations. While GPUs may offer higher throughput for individual kernels, overall application performance is limited by the inability to support on-chip sharing of data across kernels. In this paper, we show that higher on-chip memory capacity and superior on-chip communication bandwidth enables FPGAs to better support the composition of a sequence of kernels within these iterative solvers. We present a time-multiplexed FPGA architecture which exploits the on-chip capacity to store dependencies between kernels and high communication bandwidth to move data. We propose a resource-constrained framework to select the optimal value of an algorithmic parameter which provides the tradeoff between communication and computation cost for a particular FPGA. Using the Lanczos Method as a case study, we show how to minimize communication on FPGAs by this tight algorithm-architecture interaction and get superior performance over GPU despite of its ~5x larger off-chip memory bandwidth and ~2x greater peak singleprecision floating-point performance.

Keywords

eigenvalues and eigenfunctions; field programmable gate arrays; floating point arithmetic; graphics processing units; iterative methods; mathematics computing; microprocessor chips; GPU; Lanczos method; algorithm-architecture interaction; application composition; communication cost; communication minimization problem; communication optimization; computation cost; dependency storage; iterative solvers; large-scale eigenvalue problems; linear systems-of-equations; multiple linear algebra kernels; off-chip memory bandwidth; on-chip communication bandwidth; on-chip data sharing; on-chip memory capacity; peak single-precision floating-point performance; resource-constrained framework; time-multiplexed FPGA architecture; Bandwidth; Computer architecture; Field programmable gate arrays; Graphics processing units; Kernel; System-on-chip; Vectors; Communication-Avoiding Iterative Solvers; FPGAs; GPUs; Matrix Powers; SpMV;

fLanguage

English

Publisher

ieee

Conference_Titel

Field-Programmable Custom Computing Machines (FCCM), 2013 IEEE 21st Annual International Symposium on

Conference_Location

Seattle, WA

Print_ISBN

978-1-4673-6005-0

Type

conf

DOI

10.1109/FCCM.2013.16

Filename

6546011