• DocumentCode
    1783358
  • Title

    Anatomy of High-Performance Many-Threaded Matrix Multiplication

  • Author

    Smith, Tyler M. ; Van De Geijn, Robert ; Smelyanskiy, Mikhail ; Hammond, Jeff R. ; Van Zee, Field G.

  • Author_Institution
    Dept. of Comput. Sci., Univ. of Texas at Austin, Austin, TX, USA
  • fYear
    2014
  • fDate
    19-23 May 2014
  • Firstpage
    1049
  • Lastpage
    1059
  • Abstract
    BLIS is a new framework for rapid instantiation of the BLAS. We describe how BLIS extends the "GotoBLAS approach" to implementing matrix multiplication (GEMM). While GEMM was previously implemented as three loops around an inner kernel, BLIS exposes two additional loops within that inner kernel, casting the computation in terms of the BLIS micro-kernel so that porting GEMM becomes a matter of customizing this micro-kernel for a given architecture. We discuss how this facilitates a finer level of parallelism that greatly simplifies the multithreading of GEMM as well as additional opportunities for parallelizing multiple loops. Specifically, we show that with the advent of many-core architectures such as the IBM PowerPC A2 processor (used by Blue Gene/Q) and the Intel Xeon Phi processor, parallelizing both within and around the inner kernel, as the BLIS approach supports, is not only convenient, but also necessary for scalability. The resulting implementations deliver what we believe to be the best open source performance for these architectures, achieving both impressive performance and excellent scalability.
  • Keywords
    matrix multiplication; multi-threading; multiprocessing systems; parallel architectures; parallel processing; BLAS-like library instantiation software; BLIS microkernel; Blue Gene/Q system; GEMM porting; GotoBLAS approach; IBM PowerPC A2 processor; Intel Xeon Phi processor; high-performance many-threaded matrix multiplication anatomy; many-core architectures; multiple loop parallelism; multithreading; open source performance; Computer architecture; Instruction sets; Integrated circuits; Kernel; Libraries; Parallel processing; Scalability; BLAS; high-performance; libraries; linear algebra; matrix; multicore;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Parallel and Distributed Processing Symposium, 2014 IEEE 28th International
  • Conference_Location
    Phoenix, AZ
  • ISSN
    1530-2075
  • Print_ISBN
    978-1-4799-3799-8
  • Type

    conf

  • DOI
    10.1109/IPDPS.2014.110
  • Filename
    6877334