مرکز منطقه ای اطلاع رساني علوم و فناوري - Matrix-matrix multiplication on a large register file architecture with indirection

DocumentCode :

3591099

Title :

Matrix-matrix multiplication on a large register file architecture with indirection

Author :

Sreedhar, Dheeraj ; Derby, J.H. ; Montoye, R.K. ; Johnson, C.L.

Author_Institution :

IBM Res., Bangalore, India

fYear :

2014

Firstpage :

Lastpage :

Abstract :

Dense matrix-matrix multiply is an important kernel in many high performance computing applications including the emerging deep neural network based cognitive computing applications. Graphical processing units (GPU) have been very successful in handling dense matrix-matrix multiply in a variety of applications. However, recent research has shown that GPUs are very inefficient in using the available compute resources on the silicon for matrix multiply in terms of utilization of peak floating point operations per second (FLOPS). In this paper, we show that an architecture with a large register file supported by “indirection ” can utilize the floating point computing resources on the processor much more efficiently. A key feature of our proposed in-line accelerator is a bank-based very-large register file, with embedded SIMD support. This processor-in-regfile (PIR) strategy is implemented as local computation elements (LCEs) attached to each bank, overcoming the limited number of register file ports. Because each LCE is a SIMD computation element, and all of them can proceed concurrently, the PIR approach constitutes a highly-parallel super-wide-SIMD device. We show that we can achieve more than 25% better performance than the best known results for matrix multiply using GPUs. This is achieved using far lesser floating point computing units and hence lesser silicon area and power. We also show that architecture blends well with the Strassen and Winograd matrix multiply algorithms. We optimize the selective data parallelism that the LCEs enable for these algorithms and study the area-performance trade-offs.

Keywords :

floating point arithmetic; graphics processing units; mathematics computing; matrix multiplication; parallel processing; FLOPS; GPU; LCEs; PIR strategy; SIMD computation element; Strassen matrix multiply algorithm; Winograd matrix multiply algorithm; area-performance trade-offs; bank-based very-large register file; cognitive computing applications; deep neural network; embedded SIMD support; floating point computing resources; floating point operations per second; graphical processing units; high performance computing applications; highly-parallel super-wide-SIMD device; in-line accelerator; local computation elements; matrix-matrix multiplication; processor-in-regfile strategy; register file architecture; selective data parallelism; Graphics processing units; Matrix decomposition; Multicore processing; Parallel processing; Ports (Computers); Registers; Dense Matrix multiply; GPU; SIMD; Vector processor;

fLanguage :

English

Publisher :

ieee

Conference_Titel :

High Performance Computing (HiPC), 2014 21st International Conference on

Print_ISBN :

978-1-4799-5975-4

Type :

conf

DOI :

10.1109/HiPC.2014.7116709

Filename :

7116709

Link To Document :

https://search.ricest.ac.ir/dl/search/defaultta.aspx?DTC=49&DC=3591099