Acceleration of dRMSD calculation and efficient usage of GPU caches

Author

Filipovic, Jiri ; Plhak, Jan ; Strelak, David

Author_Institution

Fac. of Inf., Masaryk Univ., Brno, Czech Republic

fYear

2015

fDate

20-24 July 2015

Firstpage

47

Lastpage

54

Abstract

In this paper, we introduce the GPU acceleration of dRMSD algorithm, used to compare different structures of a molecule. Comparing to multithreaded CPU implementation, we have reached 13.4× speedup in clustering and 62.7× speedup in I:I dRMSD computation using mid-end GPU. The dRMSD computation exposes strong memory locality and thus is compute-bound. Along with conservative implementation using shared memory, we have decided to implement variants of the algorithm using GPU caches to maintain memory locality. Our implementation using cache reaches 96.5% and 91.6% of shared memory performance on Fermi and Maxwell, respectively. We have identified several performance pitfalls related to cache blocking in compute-bound codes and suggested optimization techniques to improve the performance.

Keywords

cache storage; graphics processing units; multi-threading; optimisation; pattern clustering; shared memory systems; Fermi; GPU acceleration; GPU caches; Maxwell; cache blocking; clustering speedup; compute-bound codes; dRMSD calculation; memory locality; mid-end GPU; multithreaded CPU implementation; optimization techniques; shared memory; shared memory performance; Bandwidth; Computer architecture; Graphics processing units; Instruction sets; Kernel; Optimization; Registers; GPU; RMSD; cache; code optimization;

fLanguage

English

Publisher

ieee

Conference_Titel

High Performance Computing & Simulation (HPCS), 2015 International Conference on

Conference_Location

Amsterdam

Print_ISBN

978-1-4673-7812-3

Type

conf

DOI

10.1109/HPCSim.2015.7237020

Filename

7237020