Title :
Integrated CUDA-to-FPGA Synthesis with Network-on-Chip
Author :
Gurumani, Swathi T. ; Tolar, Jacob ; Yao Chen ; Yun Liang ; Rupnow, Kyle ; Deming Chen
Author_Institution :
Adv. Digital Sci. Center, Singapore, Singapore
Abstract :
Data parallel languages such as CUDA and Open CL efficiently describe many parallel threads of computation, and HLS tools can effectively translate these descriptions into independent optimized cores. As the number of instantiated cores grows, average external memory access latency can be a significant factor in system performance. However, although each core produces outputs independently, the cores often heavily share input data. Exploiting on-chip data sharing both reduces external bandwidth demand and improves the average memory access latency, allowing the system to improve performance at the same number of cores. In this paper, we develop a network-on-chip coupled with computation cores synthesized from CUDA for FPGAs that enables on-chip data sharing. We demonstrate reduced external bandwidth demand by up to 60% (average 56%) and total application latency in cycles by up to 43% (average 27%).
Keywords :
field programmable gate arrays; high level synthesis; multi-threading; network-on-chip; parallel architectures; parallel languages; HLS tools; OpenCL; data parallel languages; external bandwidth demand; external memory access latency; integrated CUDA-to-FPGA synthesis; network-on-chip; on-chip data sharing; parallel threads; system performance; Bandwidth; Field programmable gate arrays; Graphics processing units; Kernel; Network-on-chip; Ports (Computers); CUDA; Directory protocol; High-level Synthesis; Memory bandwidth; NoC;
Conference_Titel :
Field-Programmable Custom Computing Machines (FCCM), 2014 IEEE 22nd Annual International Symposium on
Conference_Location :
Boston, MA
Print_ISBN :
978-1-4799-5110-9
DOI :
10.1109/FCCM.2014.14