DocumentCode :
1950522
Title :
Matrix factorization using distributed panels on the Fujitsu AP1000
Author :
Strazdins, Peter
Author_Institution :
Dept. of Comput. Sci., Australian Nat. Univ., Acton, ACT, Australia
Volume :
1
fYear :
1995
fDate :
19-21 Apr 1995
Firstpage :
263
Abstract :
Dense linear algebra computations such as matrix factorization require the technique of `block-partitioned algorithms´ for their efficient implementation on memory-hierarchy processors. For scalar-based distributed memory multiprocessors, the register, cache and off-processor memory levels of the memory hierarchy all affect the optimal block-partition size for such algorithms. Most studies on matrix factorization and similar algorithms have assumed that the block-partition size or panel width for the algorithm, w, to be the same as the matrix distribution block size, r, where a rectangular block-cyclic matrix distribution is being employed. Here the choice of w=r is essentially determined by the off-processor memory level of the memory hierarchy, with the valve of w being a tradeoff between communication startup overhead and load balance considerations. In this paper, we re-examine this assumption in the contest of LU and Cholesky factorization of block-cyclic distributed matrices on scalar-based distributed memory multiprocessors, such as the Fujitsu AP1000. Here considerations of the register and cache levels of the hierarchy require a large w. We find that the choice of w, given w=r, leads to a tradeoff between load balance and optimal use of register and cache levels of the hierarchy (rather than communication startup), and that this tradeoff substantially limits performance. We then briefly describe `distributed panels´ versions of these algorithms, where generally w>r, which effectively diminishes this tradeoff to an O(w/N) fraction of the overall computation, where N is the matrix size. Two variants of these versions, one with single rows/columns being communicated, and one with single block rows/columns being communicated, are analyzed for their load balance properties. The results of the distributed panels versions of the algorithms on the scalar-based distributed memory multiprocessor the Fujitsu AP1000 are given, which give significantly superior performance for distributed panels versions over the w=r versions, with optimum performance achieved for r≈1
Keywords :
linear algebra; mathematics computing; multiprocessing systems; Cholesky factorization; Fujitsu AP1000; LU factorization; block-cyclic distributed matrices; block-partition size; block-partitioned algorithms; communication startup overhead; distributed panels; linear algebra computations; load balance; load balance properties; matrix factorization; memory-hierarchy processors; panel width; scalar-based distributed memory multiprocessors; Algorithm design and analysis; Australia; Brillouin scattering; Computer science; Context; High performance computing; Linear algebra; Partitioning algorithms; Registers; Vector processors;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Algorithms and Architectures for Parallel Processing, 1995. ICAPP 95. IEEE First ICA/sup 3/PP., IEEE First International Conference on
Conference_Location :
Brisbane, Qld.
Print_ISBN :
0-7803-2018-2
Type :
conf
DOI :
10.1109/ICAPP.1995.472194
Filename :
472194
Link To Document :
بازگشت