DocumentCode :
3591099
Title :
Matrix-matrix multiplication on a large register file architecture with indirection
Author :
Sreedhar, Dheeraj ; Derby, J.H. ; Montoye, R.K. ; Johnson, C.L.
Author_Institution :
IBM Res., Bangalore, India
fYear :
2014
Firstpage :
1
Lastpage :
10
Abstract :
Dense matrix-matrix multiply is an important kernel in many high performance computing applications including the emerging deep neural network based cognitive computing applications. Graphical processing units (GPU) have been very successful in handling dense matrix-matrix multiply in a variety of applications. However, recent research has shown that GPUs are very inefficient in using the available compute resources on the silicon for matrix multiply in terms of utilization of peak floating point operations per second (FLOPS). In this paper, we show that an architecture with a large register file supported by “indirection ” can utilize the floating point computing resources on the processor much more efficiently. A key feature of our proposed in-line accelerator is a bank-based very-large register file, with embedded SIMD support. This processor-in-regfile (PIR) strategy is implemented as local computation elements (LCEs) attached to each bank, overcoming the limited number of register file ports. Because each LCE is a SIMD computation element, and all of them can proceed concurrently, the PIR approach constitutes a highly-parallel super-wide-SIMD device. We show that we can achieve more than 25% better performance than the best known results for matrix multiply using GPUs. This is achieved using far lesser floating point computing units and hence lesser silicon area and power. We also show that architecture blends well with the Strassen and Winograd matrix multiply algorithms. We optimize the selective data parallelism that the LCEs enable for these algorithms and study the area-performance trade-offs.
Keywords :
floating point arithmetic; graphics processing units; mathematics computing; matrix multiplication; parallel processing; FLOPS; GPU; LCEs; PIR strategy; SIMD computation element; Strassen matrix multiply algorithm; Winograd matrix multiply algorithm; area-performance trade-offs; bank-based very-large register file; cognitive computing applications; deep neural network; embedded SIMD support; floating point computing resources; floating point operations per second; graphical processing units; high performance computing applications; highly-parallel super-wide-SIMD device; in-line accelerator; local computation elements; matrix-matrix multiplication; processor-in-regfile strategy; register file architecture; selective data parallelism; Graphics processing units; Matrix decomposition; Multicore processing; Parallel processing; Ports (Computers); Registers; Dense Matrix multiply; GPU; SIMD; Vector processor;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
High Performance Computing (HiPC), 2014 21st International Conference on
Print_ISBN :
978-1-4799-5975-4
Type :
conf
DOI :
10.1109/HiPC.2014.7116709
Filename :
7116709
Link To Document :
بازگشت