Title :
Application Composition and Communication Optimization in Iterative Solvers Using FPGAs
Author :
Rafique, Aasim ; Kapre, Nachiket ; Constantinides, George A.
Author_Institution :
Dept. of Electr. & Electron. Eng., Imperial Coll. London, London, UK
Abstract :
We consider the problem of minimizing communication with off-chip memory and composition of multiple linear algebra kernels in iterative solvers for solving large-scale eigenvalue problems and linear systems of equations. While GPUs may offer higher throughput for individual kernels, overall application performance is limited by the inability to support on-chip sharing of data across kernels. In this paper, we show that higher on-chip memory capacity and superior on-chip communication bandwidth enables FPGAs to better support the composition of a sequence of kernels within these iterative solvers. We present a time-multiplexed FPGA architecture which exploits the on-chip capacity to store dependencies between kernels and high communication bandwidth to move data. We propose a resource-constrained framework to select the optimal value of an algorithmic parameter which provides the tradeoff between communication and computation cost for a particular FPGA. Using the Lanczos Method as a case study, we show how to minimize communication on FPGAs by this tight algorithm-architecture interaction and get superior performance over GPU despite of its ~5x larger off-chip memory bandwidth and ~2x greater peak singleprecision floating-point performance.
Keywords :
eigenvalues and eigenfunctions; field programmable gate arrays; floating point arithmetic; graphics processing units; iterative methods; mathematics computing; microprocessor chips; GPU; Lanczos method; algorithm-architecture interaction; application composition; communication cost; communication minimization problem; communication optimization; computation cost; dependency storage; iterative solvers; large-scale eigenvalue problems; linear systems-of-equations; multiple linear algebra kernels; off-chip memory bandwidth; on-chip communication bandwidth; on-chip data sharing; on-chip memory capacity; peak single-precision floating-point performance; resource-constrained framework; time-multiplexed FPGA architecture; Bandwidth; Computer architecture; Field programmable gate arrays; Graphics processing units; Kernel; System-on-chip; Vectors; Communication-Avoiding Iterative Solvers; FPGAs; GPUs; Matrix Powers; SpMV;
Conference_Titel :
Field-Programmable Custom Computing Machines (FCCM), 2013 IEEE 21st Annual International Symposium on
Conference_Location :
Seattle, WA
Print_ISBN :
978-1-4673-6005-0
DOI :
10.1109/FCCM.2013.16