• DocumentCode
    1783356
  • Title

    Effectively Exploiting Parallel Scale for All Problem Sizes in LU Factorization

  • Author

    Hasan, M.R. ; Whaley, R. Clint

  • Author_Institution
    Center for Comput. & Technol., Louisiana State Univ., Baton Rouge, LA, USA
  • fYear
    2014
  • fDate
    19-23 May 2014
  • Firstpage
    1039
  • Lastpage
    1048
  • Abstract
    LU factorization is one of the most widely-used methods for solving linear equations, and thus its performance underlies a broad range of scientific computing. As architectural trends have replaced clock rate improvements with increases in parallel scale, library writers have responded by using tiled algorithms, where operand size is constrained in order to maximize parallelism, as seen in the well-known PLASMA library. This approach has two main drawbacks: (1) asymptotic performance is reduced due to limited operand size, (2) performance of small to medium sized problems is reduced due to unnecessary data motion in the parallel caches. In this paper we introduce a new approach where asymptotic performance is maximized by using special low-overhead kernel primitives that are auto-generated by the ATLAS framework, while unnecessary cache motion is minimized by using explicit cache management. We show that this technique can outperform all known libraries at all problem sizes on commodity parallel Intel and AMD platforms, with asymptotic LU performance of roughly 91% of hardware theoretical peak for a 12-core Intel Xeon, and 87% for a 32-core AMD Opteron.
  • Keywords
    cache storage; matrix decomposition; multiprocessing systems; parallel algorithms; AMD Opteron; ATLAS framework; Intel Xeon; LU factorization; PLASMA library; asymptotic LU performance; asymptotic performance; clock rate; explicit cache management; limited operand size; linear equations; low-overhead kernel primitives; parallel caches; parallel scale; scientific computing; small to medium sized problems; tiled algorithms; unnecessary data motion; Kernel; Libraries; Optimization; Parallel processing; Plasmas; Principal component analysis; Timing; ATLAS; LAPACK; LU factorization; PCA; PLASMA; parallel linear algebra; threaded parallelism;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Parallel and Distributed Processing Symposium, 2014 IEEE 28th International
  • Conference_Location
    Phoenix, AZ
  • ISSN
    1530-2075
  • Print_ISBN
    978-1-4799-3799-8
  • Type

    conf

  • DOI
    10.1109/IPDPS.2014.109
  • Filename
    6877333