مرکز منطقه ای اطلاع رساني علوم و فناوري - Approaches for parallelizing reductions on modern GPUs

DocumentCode :

2515826

Title :

Approaches for parallelizing reductions on modern GPUs

Author :

Huo, Xin ; Ravi, Vignesh T. ; Ma, Wenjing ; Agrawal, Gagan

Author_Institution :

Dept. of Comput. Sci. & Eng., Ohio State Univ., Columbus, OH, USA

fYear :

2010

fDate :

19-22 Dec. 2010

Firstpage :

Lastpage :

Abstract :

GPU hardware and software has been evolving rapidly. CUDA versions 1.1 and higher started supporting atomic operations on device memory, and CUDA versions 1.2 and higher started supporting atomic operations on shared memory. This paper focuses on parallelizing applications involving reductions on GPUs. Prior to the availability of support for locking, these applications could only be parallelized using full replication, i.e., by creating a copy of the reduction object for each thread. However, CUDA 1.1 (1.2) onwards, use of atomic operations (on shared memory) is another option, though some effort is still required in supporting locking on floating point numbers and for supporting coarse-grained locking. Based on the tradeoffs between locking and full replication, we also introduce a hybrid approach, in which a group of threads use atomic operations to update one copy of the reduction object. Using three data mining algorithms that follow the reduction structure - k-means clustering, Principal Component Analysis (PCA) and k-nearest neighbor search (kNN), we evaluate the relative performance of these three approaches. We show how the relative performance of these techniques can vary depending upon the application and its parameters. The hybrid approach we have introduced clearly outperforms other approaches in several cases.

Keywords :

computer graphic equipment; coprocessors; data mining; multi-threading; pattern clustering; principal component analysis; shared memory systems; CUDA version; GPU hardware; GPU software; atomic operation; coarse grained locking; data mining algorithm; device memory; floating point number; full replication; k-nearest neighbor search; parallelizing reduction; principal component analysis; reduction object; shared memory; structure k-mean clustering; Clustering algorithms; Data mining; Graphics processing unit; Instruction sets; Performance evaluation; Principal component analysis; System recovery;

fLanguage :

English

Publisher :

ieee

Conference_Titel :

High Performance Computing (HiPC), 2010 International Conference on

Conference_Location :

Dona Paula

Print_ISBN :

978-1-4244-8518-5

Electronic_ISBN :

978-1-4244-8519-2

Type :

conf

DOI :

10.1109/HIPC.2010.5713189

Filename :

5713189

Link To Document :

https://search.ricest.ac.ir/dl/search/defaultta.aspx?DTC=49&DC=2515826