DocumentCode
1783356
Title
Effectively Exploiting Parallel Scale for All Problem Sizes in LU Factorization
Author
Hasan, M.R. ; Whaley, R. Clint
Author_Institution
Center for Comput. & Technol., Louisiana State Univ., Baton Rouge, LA, USA
fYear
2014
fDate
19-23 May 2014
Firstpage
1039
Lastpage
1048
Abstract
LU factorization is one of the most widely-used methods for solving linear equations, and thus its performance underlies a broad range of scientific computing. As architectural trends have replaced clock rate improvements with increases in parallel scale, library writers have responded by using tiled algorithms, where operand size is constrained in order to maximize parallelism, as seen in the well-known PLASMA library. This approach has two main drawbacks: (1) asymptotic performance is reduced due to limited operand size, (2) performance of small to medium sized problems is reduced due to unnecessary data motion in the parallel caches. In this paper we introduce a new approach where asymptotic performance is maximized by using special low-overhead kernel primitives that are auto-generated by the ATLAS framework, while unnecessary cache motion is minimized by using explicit cache management. We show that this technique can outperform all known libraries at all problem sizes on commodity parallel Intel and AMD platforms, with asymptotic LU performance of roughly 91% of hardware theoretical peak for a 12-core Intel Xeon, and 87% for a 32-core AMD Opteron.
Keywords
cache storage; matrix decomposition; multiprocessing systems; parallel algorithms; AMD Opteron; ATLAS framework; Intel Xeon; LU factorization; PLASMA library; asymptotic LU performance; asymptotic performance; clock rate; explicit cache management; limited operand size; linear equations; low-overhead kernel primitives; parallel caches; parallel scale; scientific computing; small to medium sized problems; tiled algorithms; unnecessary data motion; Kernel; Libraries; Optimization; Parallel processing; Plasmas; Principal component analysis; Timing; ATLAS; LAPACK; LU factorization; PCA; PLASMA; parallel linear algebra; threaded parallelism;
fLanguage
English
Publisher
ieee
Conference_Titel
Parallel and Distributed Processing Symposium, 2014 IEEE 28th International
Conference_Location
Phoenix, AZ
ISSN
1530-2075
Print_ISBN
978-1-4799-3799-8
Type
conf
DOI
10.1109/IPDPS.2014.109
Filename
6877333
Link To Document