Effectively Exploiting Parallel Scale for All Problem Sizes in LU Factorization

Author

Hasan, M.R. ; Whaley, R. Clint

Author_Institution

Center for Comput. & Technol., Louisiana State Univ., Baton Rouge, LA, USA

fYear

2014

fDate

19-23 May 2014

Firstpage

1039

Lastpage

1048

Abstract

LU factorization is one of the most widely-used methods for solving linear equations, and thus its performance underlies a broad range of scientific computing. As architectural trends have replaced clock rate improvements with increases in parallel scale, library writers have responded by using tiled algorithms, where operand size is constrained in order to maximize parallelism, as seen in the well-known PLASMA library. This approach has two main drawbacks: (1) asymptotic performance is reduced due to limited operand size, (2) performance of small to medium sized problems is reduced due to unnecessary data motion in the parallel caches. In this paper we introduce a new approach where asymptotic performance is maximized by using special low-overhead kernel primitives that are auto-generated by the ATLAS framework, while unnecessary cache motion is minimized by using explicit cache management. We show that this technique can outperform all known libraries at all problem sizes on commodity parallel Intel and AMD platforms, with asymptotic LU performance of roughly 91% of hardware theoretical peak for a 12-core Intel Xeon, and 87% for a 32-core AMD Opteron.

Keywords

cache storage; matrix decomposition; multiprocessing systems; parallel algorithms; AMD Opteron; ATLAS framework; Intel Xeon; LU factorization; PLASMA library; asymptotic LU performance; asymptotic performance; clock rate; explicit cache management; limited operand size; linear equations; low-overhead kernel primitives; parallel caches; parallel scale; scientific computing; small to medium sized problems; tiled algorithms; unnecessary data motion; Kernel; Libraries; Optimization; Parallel processing; Plasmas; Principal component analysis; Timing; ATLAS; LAPACK; LU factorization; PCA; PLASMA; parallel linear algebra; threaded parallelism;

fLanguage

English

Publisher

ieee

Conference_Titel

Parallel and Distributed Processing Symposium, 2014 IEEE 28th International

Conference_Location

Phoenix, AZ

ISSN

1530-2075

Print_ISBN

978-1-4799-3799-8

Type

conf

DOI

10.1109/IPDPS.2014.109

Filename

6877333