DocumentCode :
1126595
Title :
An Experimental Study of Self-Optimizing Dense Linear Algebra Software
Author :
Kulkarni, M. ; Pingali, K.
Author_Institution :
Univ. of Texas, Austin
Volume :
96
Issue :
5
fYear :
2008
fDate :
5/1/2008 12:00:00 AM
Firstpage :
832
Lastpage :
848
Abstract :
Memory hierarchy optimizations have been studied by researchers in many areas including compilers, numerical linear algebra, and theoretical computer science. However, the approaches taken by these communities are very different. The compiler community has invested considerable effort in inventing loop transformations like loop permutation and tiling, and in the development of simple analytical models to determine the values of numerical parameters such as tile sizes required by these transformations. Although the performance of compiler-generated code has improved steadily over the years, it is difficult to retarget restructuring compilers to new platforms because of the need to develop analytical models manually for new platforms. The search for performance portability has led to the development of self-optimizing software systems. One approach to self-optimizing software is the generate-and-test approach, which has been used by the dense numerical linear algebra community to produce high- performance BLAS and fast Fourier transform libraries. Another approach to portable memory hierarchy optimization is to use the divide-and-conquer approach to implementing cache- oblivious algorithms. Each step of divide-and-conquer generates problems of smaller size. When the working set of the subproblems fits in some level of the memory hierarchy, that subproblem can be executed without capacity misses at that level. Although all three approaches have been studied extensively, there are few experimental studies that have compared these approaches. How well does the code produced by current self-optimizing systems perform compared to hand-tuned code? Is empirical search essential to the generate-and- test approach or is it possible to use analytical models with platform-specific parameters to reduce the size of the search space? The cache-oblivious approach uses divide-and-conquer to perform approximate blocking; how well does approximate blocking perform compared to precise - locking? This paper addresses such questions for matrix multiplication, which is the most important dense linear algebra kernel.
Keywords :
cache storage; divide and conquer methods; linear algebra; mathematics computing; matrix multiplication; memory architecture; program compilers; self-adjusting systems; approximate blocking; cache blocking; cache memory; cache-oblivious approach; compiler generated-code; dense linear algebra kernel; divide-and-conquer; generate-and-test approach; matrix multiplication; memory architecture; portable memory hierarchy optimization; self-optimizing dense linear algebra software; self-optimizing software system; Analytical models; Computer science; Fast Fourier transforms; Linear algebra; Optimizing compilers; Software libraries; Software performance; Software systems; Algorithms; cache blocking; cache memories; computer performance; linear algebra; matrix multiplication; memory architecture; tiling;
fLanguage :
English
Journal_Title :
Proceedings of the IEEE
Publisher :
ieee
ISSN :
0018-9219
Type :
jour
DOI :
10.1109/JPROC.2008.917732
Filename :
4484942
Link To Document :
بازگشت