DocumentCode :
746138
Title :
Tiling, block data layout, and memory hierarchy performance
Author :
Park, Neungsoo ; Hong, Bo ; Prasanna, Viktor K.
Author_Institution :
Samsung Electron. Co, Seoul, South Korea
Volume :
14
Issue :
7
fYear :
2003
fDate :
7/1/2003 12:00:00 AM
Firstpage :
640
Lastpage :
654
Abstract :
Recently, several experimental studies have been conducted on block data layout in conjunction with tiling as a data transformation technique to improve cache performance. In this paper, we analyze cache and translation look-aside buffer (TLB) performance of such alternate layouts (including block data layout and Morton layout) when used in conjunction with tiling. We derive a tight lower bound on TLB performance for standard matrix access patterns, and show that block data layout and Morton layout achieve this bound. To improve cache performance, block data layout is used in concert with tiling. Based on the cache and TLB performance analysis, we propose a data block size selection algorithm that finds a tight range for optimal block size. To validate our analysis, we conducted simulations and experiments using tiled matrix multiplication, LU decomposition, and Cholesky factorization. For matrix multiplication, simulation results using UltraSparc II parameters show that tiling and block data layout with a block size given by our block size selection algorithm, reduces up to 93 percent of TLB misses compared with other techniques. The total miss cost is reduced considerably. Experiments on several platforms show that tiling with block data layout achieves up to 50 percent performance improvement over other techniques that use conventional layouts. Morton layout is also analyzed and compared with block data layout. Experimental results show that matrix multiplication using block data layout is up to 15 percent faster than that using Morton data layout.
Keywords :
cache storage; matrix multiplication; optimisation; performance evaluation; storage management; Morton data layout; block data layout; cache memory; cache misses; lower bound; matrix multiplication; multilevel memory hierarchy; optimization; tiling; translation look-aside buffer; Analytical models; Costs; Degradation; Delay; Hardware; Helium; Matrix decomposition; Performance analysis; Programming profession; Streaming media;
fLanguage :
English
Journal_Title :
Parallel and Distributed Systems, IEEE Transactions on
Publisher :
ieee
ISSN :
1045-9219
Type :
jour
DOI :
10.1109/TPDS.2003.1214317
Filename :
1214317
Link To Document :
بازگشت