DocumentCode
167298
Title
Autotuning Tensor Transposition
Author
Lai Wei ; Mellor-Crummey, J.
fYear
2014
fDate
19-23 May 2014
Firstpage
342
Lastpage
351
Abstract
Tensor transposition, a generalization of matrix transposition, is an important primitive used when performing tensor contraction. Efficient implementation of tensor transposition for modern node architectures depends on various architecture capabilities such as cache and memory hierarchy, threads, and SIMD parallelism. This paper introduces a framework that uses static analysis and empirical autotuning to produce optimized parallel tensor transposition code for node architectures using a rule-based code generation and transformation system. By exploring various optimization techniques with different settings, our framework achieves more than 80% of the bandwidth of memcpy for tensors on two very different node architectures, one a dual-socket system with Intel Westmere processors and the other a quad-socket system with IBM Power7 processors.
Keywords
matrix algebra; optimisation; parallel processing; program compilers; program diagnostics; tensors; empirical autotuning; matrix transposition; node architectures; optimization techniques; parallel tensor transposition code; rule-based code generation; static analysis; tensor contraction; Arrays; Bandwidth; Optimization; Prefetching; Tensile stress;
fLanguage
English
Publisher
ieee
Conference_Titel
Parallel & Distributed Processing Symposium Workshops (IPDPSW), 2014 IEEE International
Conference_Location
Phoenix, AZ
Print_ISBN
978-1-4799-4117-9
Type
conf
DOI
10.1109/IPDPSW.2014.43
Filename
6969409
Link To Document