Padding free bank conflict resolution for CUDA-based matrix transpose algorithm

Author

Khan, Ajmal ; Al-Mouhamed, Mayez ; Fatayar, A. ; Almousa, A. ; Baqais, A. ; Assayony, M.

Author_Institution

Dept. of Comput. Eng., King Fahd Univ. of Pet. & Miner., Dhahran, Saudi Arabia

fYear

2014

fDate

June 30 2014-July 2 2014

Firstpage

1

Lastpage

6

Abstract

Matrix Transposition is an important linear algebra procedure that has deep impact in various computational science and engineering applications. Several factors hinder the expected performance of large matrix transpose on Graphic Processing Units (GPUs). The degradation in performance involves the memory access pattern such as coalesced access in the global memory and bank conflict in the shared memory of streaming multiprocessors within the GPU. In this paper, two matrix transpose algorithms are proposed to alleviate the aforementioned issues of ensuring coalesced access and conflict free bank access. The proposed algorithms have comparable execution times with the NVIDIA SDK bank conflict - free matrix transpose implementation. The main advantage of proposed algorithms is that they eliminate bank conflicts while allocating shared memory exactly equal to the tile size (T × T) of the problem space. However, to the best of our knowledge an extra space of Tx(T +1) needs to be allocated in the published research. We have also applied the proposed transpose algorithm to recursive Gaussian implementation of NVIDIA SDK and achieved about 6% improvement in performance.

Keywords

graphics processing units; mathematics computing; matrix algebra; parallel architectures; shared memory systems; storage allocation; CUDA-based matrix transpose algorithm; GPU; NVIDIA SDK bank conflict-free matrix transpose; coalesced access; computational engineering application; computational science application; conflict free bank access; graphic processing units; linear algebra procedure; matrix transposition; memory access pattern; padding free bank conflict resolution; recursive Gaussian implementation; shared memory allocation; shared streaming multiprocessor memory; Algorithm design and analysis; Graphics processing units; Indexes; Instruction sets; Kernel; Linear algebra; Writing; Bank conflict free; CUDA GPU; coalesced memory access; linear Algebra solvers; matrix transpose; solving system of linear equations;

fLanguage

English

Publisher

ieee

Conference_Titel

Software Engineering, Artificial Intelligence, Networking and Parallel/Distributed Computing (SNPD), 2014 15th IEEE/ACIS International Conference on

Conference_Location

Las Vegas, NV

Type

conf

DOI

10.1109/SNPD.2014.6888709

Filename

6888709