DocumentCode :
2979007
Title :
Optimizing Dynamic Programming on Graphics Processing Units Via Data Reuse and Data Prefetch with Inter-Block Barrier Synchronization
Author :
Chao-Chin Wu ; Kai-Cheng Wei ; Ting-Hong Lin
Author_Institution :
Dept. of Comput. Sci. & Inf. Eng., Nat. Changhua Univ. of Educ., Changhua, Taiwan
fYear :
2012
fDate :
17-19 Dec. 2012
Firstpage :
45
Lastpage :
52
Abstract :
Our previous study focused on accelerating an important category of DP problems, called nonserial polyadic dynamic programming (NPDP), on a graphics processing unit (GPU). In NPDP applications, the degree of parallelism varies significantly in different stages of computation, making it difficult to fully utilize the compute power of hundreds of pro-cessing cores in a GPU. To address this challenge, we proposed a methodology that can adaptively adjust the thread-level parallelism in mapping a NPDP problem onto the GPU, thus providing sufficient and steady degrees of parallelism across different compute stages. This work aims at further improving the performance of NPDP problems. Sub problems and data are tiled to make it possible to fit small data regions into shared memory and reuse the buffered data for each tile of sub problems, thus reducing the amount of global memory access. However, we found invoking the same kernel many times, due to data consistency enforcement across different stages, makes it impossible to reuse the tiled data in shared memory after the kernel is invoked again. Fortunately, the inter-block synchronization technique allows us to invoke the kernel exactly one time with the restriction that the maximum number of blocks is equal to the total number of streaming multiprocessors. In addition to data reuse, invoking the kernel only one time also enables us to prefetch data to shared memory across inter-block synchronization point, which improves the performance more than data reuse. We realize our approach in a real-world NPDP application â" the optimal matrix parenthesization problem. Experimental results demonstrate invoking a kernel only one time cannot guarantee performance improvement unless we also reuse and prefetch data across barrier synchronization points.
Keywords :
buffer storage; data integrity; dynamic programming; graphics processing units; shared memory systems; synchronisation; GPU processing cores; NPDP applications; NPDP problem; buffered data reuse; data consistency enforcement; data prefetch; global memory access; graphics processing unit; inter-block barrier synchronization; inter-block synchronization point; inter-block synchronization technique; kernel; nonserial polyadic dynamic programming; optimal matrix parenthesization problem; parallelism degree; shared memory; streaming multiprocessors; thread-level parallelism; Graphics processing units; Kernel; Memory management; Prefetching; Synchronization; Tiles; GPU; dynamic programming; optimization; parallel computing; tiling;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Parallel and Distributed Systems (ICPADS), 2012 IEEE 18th International Conference on
Conference_Location :
Singapore
ISSN :
1521-9097
Print_ISBN :
978-1-4673-4565-1
Electronic_ISBN :
1521-9097
Type :
conf
DOI :
10.1109/ICPADS.2012.17
Filename :
6413552
Link To Document :
بازگشت