Title :
Effectively Exploiting Parallel Scale for All Problem Sizes in LU Factorization
Author :
Hasan, M.R. ; Whaley, R. Clint
Author_Institution :
Center for Comput. & Technol., Louisiana State Univ., Baton Rouge, LA, USA
Abstract :
LU factorization is one of the most widely-used methods for solving linear equations, and thus its performance underlies a broad range of scientific computing. As architectural trends have replaced clock rate improvements with increases in parallel scale, library writers have responded by using tiled algorithms, where operand size is constrained in order to maximize parallelism, as seen in the well-known PLASMA library. This approach has two main drawbacks: (1) asymptotic performance is reduced due to limited operand size, (2) performance of small to medium sized problems is reduced due to unnecessary data motion in the parallel caches. In this paper we introduce a new approach where asymptotic performance is maximized by using special low-overhead kernel primitives that are auto-generated by the ATLAS framework, while unnecessary cache motion is minimized by using explicit cache management. We show that this technique can outperform all known libraries at all problem sizes on commodity parallel Intel and AMD platforms, with asymptotic LU performance of roughly 91% of hardware theoretical peak for a 12-core Intel Xeon, and 87% for a 32-core AMD Opteron.
Keywords :
cache storage; matrix decomposition; multiprocessing systems; parallel algorithms; AMD Opteron; ATLAS framework; Intel Xeon; LU factorization; PLASMA library; asymptotic LU performance; asymptotic performance; clock rate; explicit cache management; limited operand size; linear equations; low-overhead kernel primitives; parallel caches; parallel scale; scientific computing; small to medium sized problems; tiled algorithms; unnecessary data motion; Kernel; Libraries; Optimization; Parallel processing; Plasmas; Principal component analysis; Timing; ATLAS; LAPACK; LU factorization; PCA; PLASMA; parallel linear algebra; threaded parallelism;
Conference_Titel :
Parallel and Distributed Processing Symposium, 2014 IEEE 28th International
Conference_Location :
Phoenix, AZ
Print_ISBN :
978-1-4799-3799-8
DOI :
10.1109/IPDPS.2014.109