مرکز منطقه ای اطلاع رساني علوم و فناوري - Matrix factorization using distributed panels on the Fujitsu AP1000

DocumentCode :

1950522

Title :

Matrix factorization using distributed panels on the Fujitsu AP1000

Author :

Strazdins, Peter

Author_Institution :

Dept. of Comput. Sci., Australian Nat. Univ., Acton, ACT, Australia

Volume :

fYear :

1995

fDate :

19-21 Apr 1995

Firstpage :

263

Abstract :

Dense linear algebra computations such as matrix factorization require the technique of `block-partitioned algorithms´ for their efficient implementation on memory-hierarchy processors. For scalar-based distributed memory multiprocessors, the register, cache and off-processor memory levels of the memory hierarchy all affect the optimal block-partition size for such algorithms. Most studies on matrix factorization and similar algorithms have assumed that the block-partition size or panel width for the algorithm, w, to be the same as the matrix distribution block size, r, where a rectangular block-cyclic matrix distribution is being employed. Here the choice of w=r is essentially determined by the off-processor memory level of the memory hierarchy, with the valve of w being a tradeoff between communication startup overhead and load balance considerations. In this paper, we re-examine this assumption in the contest of LU and Cholesky factorization of block-cyclic distributed matrices on scalar-based distributed memory multiprocessors, such as the Fujitsu AP1000. Here considerations of the register and cache levels of the hierarchy require a large w. We find that the choice of w, given w=r, leads to a tradeoff between load balance and optimal use of register and cache levels of the hierarchy (rather than communication startup), and that this tradeoff substantially limits performance. We then briefly describe `distributed panels´ versions of these algorithms, where generally w>r, which effectively diminishes this tradeoff to an O(w/N) fraction of the overall computation, where N is the matrix size. Two variants of these versions, one with single rows/columns being communicated, and one with single block rows/columns being communicated, are analyzed for their load balance properties. The results of the distributed panels versions of the algorithms on the scalar-based distributed memory multiprocessor the Fujitsu AP1000 are given, which give significantly superior performance for distributed panels versions over the w=r versions, with optimum performance achieved for r≈1

Keywords :

linear algebra; mathematics computing; multiprocessing systems; Cholesky factorization; Fujitsu AP1000; LU factorization; block-cyclic distributed matrices; block-partition size; block-partitioned algorithms; communication startup overhead; distributed panels; linear algebra computations; load balance; load balance properties; matrix factorization; memory-hierarchy processors; panel width; scalar-based distributed memory multiprocessors; Algorithm design and analysis; Australia; Brillouin scattering; Computer science; Context; High performance computing; Linear algebra; Partitioning algorithms; Registers; Vector processors;

fLanguage :

English

Publisher :

ieee

Conference_Titel :

Algorithms and Architectures for Parallel Processing, 1995. ICAPP 95. IEEE First ICA/sup 3/PP., IEEE First International Conference on

Conference_Location :

Brisbane, Qld.

Print_ISBN :

0-7803-2018-2

Type :

conf

DOI :

10.1109/ICAPP.1995.472194

Filename :

472194

Link To Document :

https://search.ricest.ac.ir/dl/search/defaultta.aspx?DTC=49&DC=1950522