Title :
Mapping Dense LU Factorization on Multicore Supercomputer Nodes
Author :
Lifflander, Jonathan ; Miller, Phil ; Venkataraman, Ramprasad ; Arya, Anshu ; Kale, Laxmikant ; Jones, Terry
Author_Institution :
Univ. of Illinois Urbana-Champaign, Urbana, IL, USA
Abstract :
Dense LU factorization is a prominent benchmark used to rank the performance of supercomputers. Many implementations use block-cyclic distributions of matrix blocks onto a two-dimensional process grid. The process grid dimensions drive a trade-off between communication and computation and are architecture- and implementation-sensitive. The critical panel factorization steps can be made less communication-bound by overlapping asynchronous collectives for pivoting with the computation of rank-k updates. By shifting the computation-communication trade-off, a modified block-cyclic distribution can beneficially exploit more available parallelism on the critical path, and reduce panel factorization´s memory hierarchy contention on now-ubiquitous multicore architectures. During active panel factorization, rank-1 updates stream through memory with minimal reuse. In a column-major process grid, the performance of this access pattern degrades as too many streaming processors contend for access to memory. A block-cyclic mapping in the row-major order does not encounter this problem, but consequently sacrifices node and network locality in the critical pivoting steps. We introduce ´striding´ to vary between the two extremes of row- and column-major process grids. The maximum available parallelism in the critical path work (active panel factorization, triangular solves, and subsequent broadcasts) is bounded by the length or width of the process grid. Increasing one dimension of the process grid decreases the number of distinct processes and nodes in the other dimension. To increase the harnessed parallelism in both dimensions, we start with a tall process grid. We then apply periodic ´rotation´ to this grid to restore exploited parallelism along the row to previous levels. As a test-bed for further mapping experiments, we describe a dense LU implementation that allows a block distribution to be defined as a general function of block to processor. Other mappings can be test- d with only small, local changes to the code.
Keywords :
matrix decomposition; multiprocessing systems; parallel architectures; parallel machines; block distribution; block-cyclic distributions; block-cyclic mapping; column-major process grid; communication-bound; critical panel factorization steps; dense LU factorization mapping; exploited parallelism; matrix blocks; memory hierarchy contention; multicore supercomputer nodes; now-ubiquitous multicore architectures; overlapping asynchronous collectives; process grid dimensions; rank-k updates; row-major order; two-dimensional process grid; Benchmark testing; Computer architecture; Equations; Libraries; Parallel processing; Program processors; Supercomputers; amd istanbul opteron; bluegene; cache miss; charm++; cray xt; dense lu factorization; hpl; intel nehalem xeon; linpack; mapping; memory hierarchy contention; multicore; parallelism; process grid; scalapack;
Conference_Titel :
Parallel & Distributed Processing Symposium (IPDPS), 2012 IEEE 26th International
Conference_Location :
Shanghai
Print_ISBN :
978-1-4673-0975-2
DOI :
10.1109/IPDPS.2012.61