Author_Institution :
Dept. of Comput. Sci., Australian Nat. Univ., Acton, ACT, Australia
Abstract :
Dense linear algebra computations such as matrix factorization require the technique of `block-partitioned algorithms´ for their efficient implementation on memory-hierarchy processors. For scalar-based distributed memory multiprocessors, the register, cache and off-processor memory levels of the memory hierarchy all affect the optimal block-partition size for such algorithms. Most studies on matrix factorization and similar algorithms have assumed that the block-partition size or panel width for the algorithm, w, to be the same as the matrix distribution block size, r, where a rectangular block-cyclic matrix distribution is being employed. Here the choice of w=r is essentially determined by the off-processor memory level of the memory hierarchy, with the valve of w being a tradeoff between communication startup overhead and load balance considerations. In this paper, we re-examine this assumption in the contest of LU and Cholesky factorization of block-cyclic distributed matrices on scalar-based distributed memory multiprocessors, such as the Fujitsu AP1000. Here considerations of the register and cache levels of the hierarchy require a large w. We find that the choice of w, given w=r, leads to a tradeoff between load balance and optimal use of register and cache levels of the hierarchy (rather than communication startup), and that this tradeoff substantially limits performance. We then briefly describe `distributed panels´ versions of these algorithms, where generally w>r, which effectively diminishes this tradeoff to an O(w/N) fraction of the overall computation, where N is the matrix size. Two variants of these versions, one with single rows/columns being communicated, and one with single block rows/columns being communicated, are analyzed for their load balance properties. The results of the distributed panels versions of the algorithms on the scalar-based distributed memory multiprocessor the Fujitsu AP1000 are given, which give significantly superior performance for distributed panels versions over the w=r versions, with optimum performance achieved for r≈1
Keywords :
linear algebra; mathematics computing; multiprocessing systems; Cholesky factorization; Fujitsu AP1000; LU factorization; block-cyclic distributed matrices; block-partition size; block-partitioned algorithms; communication startup overhead; distributed panels; linear algebra computations; load balance; load balance properties; matrix factorization; memory-hierarchy processors; panel width; scalar-based distributed memory multiprocessors; Algorithm design and analysis; Australia; Brillouin scattering; Computer science; Context; High performance computing; Linear algebra; Partitioning algorithms; Registers; Vector processors;
Conference_Titel :
Algorithms and Architectures for Parallel Processing, 1995. ICAPP 95. IEEE First ICA/sup 3/PP., IEEE First International Conference on