New Scheduling Strategies and Hybrid Programming for a Parallel Right-looking Sparse LU Factorization Algorithm on Multicore Cluster Systems

Author

Yamazaki, Ichitaro ; Li, Xiaoye S.

Author_Institution

Comput. Res. Div., Lawrence Berkeley Nat. Lab., Berkeley, CA, USA

fYear

2012

fDate

21-25 May 2012

Firstpage

619

Lastpage

630

Abstract

Parallel sparse LU factorization is a key computational kernel in the solution of a large-scale linear system of equations. In this paper, we propose two strategies to address some scalability issues of a factorization algorithm on modern HPC systems. The first strategy is at the algorithmic-level, we schedule independent tasks as soon as possible to reduce the idle time and the critical path of the algorithm. We demonstrate using thousands of cores that our new scheduling strategy reduces the runtime by nearly three-fold from that of a state-of-the-art pipelined factorization algorithm. The second strategy is at both programming- and architecture-levels, we incorporate light-weight Open MP threads in each MPI process to reduce both memory and time overheads of a pure MPI implementation on many core NUMA architectures. Using this hybrid programming paradigm, we obtain a significant reduction in memory usage while achieving a parallel efficiency competitive with that of a pure MPI paradigm. As a result, in comparison to a pure MPI paradigm which failed due to the per-core memory constraint, the hybrid paradigm could utilize more cores on each node and reduce the factorization time on the same number of nodes. We show extensive performance analysis of the new strategies using thousands of cores of the two leading HPC systems, a Cray-XE6 and an IBM iDataPlex.

Keywords

application program interfaces; mathematics computing; matrix decomposition; message passing; multiprocessing systems; parallel architectures; scheduling; Cray-XE6; HPC system; IBM iDataPlex; MPI process; algorithmic-level; architecture-level; computational kernel; factorization time reduction; hybrid programming paradigm; independent task scheduling; large-scale linear system; light-weight Open MP thread; many core NUMA architecture; memory overhead reduction; memory usage reduction; multicore cluster system; parallel efficiency; parallel right-looking sparse LU factorization algorithm; per-core memory constraint; pipelined factorization algorithm; programming-level; runtime reduction; scalability issue; time overhead reduction; Linear systems; Memory management; Multicore processing; Processor scheduling; Programming; Scheduling;

fLanguage

English

Publisher

ieee

Conference_Titel

Parallel & Distributed Processing Symposium (IPDPS), 2012 IEEE 26th International

Conference_Location

Shanghai

ISSN

1530-2075

Print_ISBN

978-1-4673-0975-2

Type

conf

DOI

10.1109/IPDPS.2012.63

Filename

6267864