Integrated CUDA-to-FPGA Synthesis with Network-on-Chip

Author

Gurumani, Swathi T. ; Tolar, Jacob ; Yao Chen ; Yun Liang ; Rupnow, Kyle ; Deming Chen

Author_Institution

Adv. Digital Sci. Center, Singapore, Singapore

fYear

2014

fDate

11-13 May 2014

Firstpage

21

Lastpage

24

Abstract

Data parallel languages such as CUDA and Open CL efficiently describe many parallel threads of computation, and HLS tools can effectively translate these descriptions into independent optimized cores. As the number of instantiated cores grows, average external memory access latency can be a significant factor in system performance. However, although each core produces outputs independently, the cores often heavily share input data. Exploiting on-chip data sharing both reduces external bandwidth demand and improves the average memory access latency, allowing the system to improve performance at the same number of cores. In this paper, we develop a network-on-chip coupled with computation cores synthesized from CUDA for FPGAs that enables on-chip data sharing. We demonstrate reduced external bandwidth demand by up to 60% (average 56%) and total application latency in cycles by up to 43% (average 27%).

Keywords

field programmable gate arrays; high level synthesis; multi-threading; network-on-chip; parallel architectures; parallel languages; HLS tools; OpenCL; data parallel languages; external bandwidth demand; external memory access latency; integrated CUDA-to-FPGA synthesis; network-on-chip; on-chip data sharing; parallel threads; system performance; Bandwidth; Field programmable gate arrays; Graphics processing units; Kernel; Network-on-chip; Ports (Computers); CUDA; Directory protocol; High-level Synthesis; Memory bandwidth; NoC;

fLanguage

English

Publisher

ieee

Conference_Titel

Field-Programmable Custom Computing Machines (FCCM), 2014 IEEE 22nd Annual International Symposium on

Conference_Location

Boston, MA

Print_ISBN

978-1-4799-5110-9

Type

conf

DOI

10.1109/FCCM.2014.14

Filename

6861576