Title :
Optimization and evaluation of image- and signal-processing kernels on the TI C6678 multi-core DSP
Author :
Ramesh, Barath ; Bhardwaj, Asheesh ; Richardson, Justin ; George, Alan D. ; Lam, Herman
Author_Institution :
Dept. of Electr. & Comput. Eng., Univ. of Florida, Gainesville, FL, USA
Abstract :
Power efficiency is an important aspect in today´s high-performance embedded computing (HPEC) systems. Digital signal processors (DSPs) are well known for their power efficiency and are commonly employed in embedded systems. Increasing computational demands in image- and signal-processing applications in embedded systems has led to the development of multi-core DSPs with floating-point capabilities. The TMS320C6678 is an eight-core, high-performance DSP from Texas Instruments that provides 128 GFLOPS of single-precision and 32 GFLOPS of double-precision performance under 10W of power. In this paper, we optimize and evaluate the performance of the TMS320C6678 DSP using two image-processing kernels, 2D convolution and bilinear interpolation with image rotation, and two signal-processing kernels, frequency-domain finite impulse response (FDFIR) and corner turn. Our 2D convolution results show that the performance of the TMS320C6678 is comparable to a Nvidia GeForce 295 GTX GPU and 5 times better than a quad-core Intel Xeon W3520 CPU. We achieve real-time performance for bilinear interpolation with image rotation on the TMS320C6678 for high-definition (HD) image resolution. Our performance per Watt results for FDFIR shows that the TMS320C6678 is 8.2 times better than the Nvidia Tesla C2050 GPU. For corner turn, although the raw performance of the Tesla C2050 is better than the TMS320C6678, the performance per Watt of TMS320C6678 is 1.8 times better than the Tesla C2050.
Keywords :
convolution; digital signal processing chips; embedded systems; floating point arithmetic; image resolution; interpolation; multiprocessing systems; 2D convolution; FDFIR; HPEC system; Nvidia GeForce 295 GTX GPU; Nvidia Tesla C2050 GPU; TI C6678 multicore DSP; TMS320C6678; Texas Instruments; bilinear interpolation; computational demand; computer speed 128 GFLOPS; computer speed 32 GFLOPS; corner turn; digital signal processors; double-precision performance; eight-core high-performance DSP; floating-point capabilities; frequency-domain finite impulse response; high-definition image resolution; high-performance embedded computing system; image rotation; image-processing applications; image-processing kernel optimization; power 10 W; power efficiency; quadcore Intel Xeon W3520 CPU; real-time performance; signal-processing applications; signal-processing kernel evaluation; Computer architecture; Convolution; Digital signal processing; Interpolation; Kernel; Optimization; Random access memory;
Conference_Titel :
High Performance Extreme Computing Conference (HPEC), 2014 IEEE
Conference_Location :
Waltham, MA
Print_ISBN :
978-1-4799-6232-7
DOI :
10.1109/HPEC.2014.7040989