DocumentCode
1796530
Title
Padding free bank conflict resolution for CUDA-based matrix transpose algorithm
Author
Khan, Ajmal ; Al-Mouhamed, Mayez ; Fatayar, A. ; Almousa, A. ; Baqais, A. ; Assayony, M.
Author_Institution
Dept. of Comput. Eng., King Fahd Univ. of Pet. & Miner., Dhahran, Saudi Arabia
fYear
2014
fDate
June 30 2014-July 2 2014
Firstpage
1
Lastpage
6
Abstract
Matrix Transposition is an important linear algebra procedure that has deep impact in various computational science and engineering applications. Several factors hinder the expected performance of large matrix transpose on Graphic Processing Units (GPUs). The degradation in performance involves the memory access pattern such as coalesced access in the global memory and bank conflict in the shared memory of streaming multiprocessors within the GPU. In this paper, two matrix transpose algorithms are proposed to alleviate the aforementioned issues of ensuring coalesced access and conflict free bank access. The proposed algorithms have comparable execution times with the NVIDIA SDK bank conflict - free matrix transpose implementation. The main advantage of proposed algorithms is that they eliminate bank conflicts while allocating shared memory exactly equal to the tile size (T × T) of the problem space. However, to the best of our knowledge an extra space of Tx(T +1) needs to be allocated in the published research. We have also applied the proposed transpose algorithm to recursive Gaussian implementation of NVIDIA SDK and achieved about 6% improvement in performance.
Keywords
graphics processing units; mathematics computing; matrix algebra; parallel architectures; shared memory systems; storage allocation; CUDA-based matrix transpose algorithm; GPU; NVIDIA SDK bank conflict-free matrix transpose; coalesced access; computational engineering application; computational science application; conflict free bank access; graphic processing units; linear algebra procedure; matrix transposition; memory access pattern; padding free bank conflict resolution; recursive Gaussian implementation; shared memory allocation; shared streaming multiprocessor memory; Algorithm design and analysis; Graphics processing units; Indexes; Instruction sets; Kernel; Linear algebra; Writing; Bank conflict free; CUDA GPU; coalesced memory access; linear Algebra solvers; matrix transpose; solving system of linear equations;
fLanguage
English
Publisher
ieee
Conference_Titel
Software Engineering, Artificial Intelligence, Networking and Parallel/Distributed Computing (SNPD), 2014 15th IEEE/ACIS International Conference on
Conference_Location
Las Vegas, NV
Type
conf
DOI
10.1109/SNPD.2014.6888709
Filename
6888709
Link To Document