DocumentCode
1945628
Title
Enhancing performance of Tall-Skinny QR factorization using FPGAs
Author
Rafique, Abid ; Kapre, Nachiket ; Constantinides, George A.
Author_Institution
Electr. & Electron. Eng. Dept., Imperial Coll. London, London, UK
fYear
2012
fDate
29-31 Aug. 2012
Firstpage
443
Lastpage
450
Abstract
Communication-avoiding linear algebra algorithms with low communication latency and high memory bandwidth requirements like Tall-Skinny QR factorization (TSQR) are highly appropriate for acceleration using FPGAs. TSQR parallelizes QR factorization of tall-skinny matrices in a divide-and-conquer fashion by decomposing them into sub-matrices, performing local QR factorizations and then merging the intermediate results. As TSQR is a dense linear algebra problem, one would therefore imagine GPU to show better performance. However, the performance of GPU is limited by the memory bandwidth in local QR factorizations and global communication latency in the merge stage. We exploit the shape of the matrix and propose an FPGA-based custom architecture which avoids these bottlenecks by using high-bandwidth on-chip memories for local QR factorizations and by performing the merge stage entirely on-chip to reduce communication latency. We achieve a peak double-precision floating-point performance of 129 GFLOPs on Virtex-6 SX475T. A quantitative comparison of our proposed design with recent QR factorization on FPGAs and GPU shows up to 7.7× and 12.7× speed up respectively. Additionally, we show even higher performance over optimized linear algebra libraries like Intel MKL for multi-cores, CULA for GPUs and MAGMA for hybrid systems.
Keywords
divide and conquer methods; field programmable gate arrays; linear algebra; matrix decomposition; CULA; FPGA-based custom architecture; GFLOP; GPU; Intel MKL; MAGMA; TSQR; Tall-skinny QR factorization; Virtex-6 SX475T; communication-avoiding linear algebra algorithms; divide-and-conquer fashion; high memory bandwidth; low communication latency; multicores; optimized linear algebra libraries; peak double-precision floating-point performance; tall-skinny matrices; Computer architecture; Field programmable gate arrays; Graphics processing unit; Parallel processing; System-on-a-chip; Tiles; Vectors;
fLanguage
English
Publisher
ieee
Conference_Titel
Field Programmable Logic and Applications (FPL), 2012 22nd International Conference on
Conference_Location
Oslo
Print_ISBN
978-1-4673-2257-7
Electronic_ISBN
978-1-4673-2255-3
Type
conf
DOI
10.1109/FPL.2012.6339142
Filename
6339142
Link To Document