DocumentCode :
154135
Title :
Parallel Algorithms for the Summed Area Table on the Asynchronous Hierarchical Memory Machine, with GPU implementations
Author :
Kasagi, Akihiko ; Nakano, Kaoru ; Ito, Yu
Author_Institution :
Dept. of Inf. Eng., Hiroshima Univ., Higashi-Hiroshima, Japan
fYear :
2014
fDate :
9-12 Sept. 2014
Firstpage :
251
Lastpage :
260
Abstract :
The Hierarchical Memory Machine (HMM) is a theoretical parallel computing model that captures the essence of computing on CUDA-enabled GPUs. The summed area table (SAT) of a matrix is a data structure frequently used in the area of computer vision which can be obtained by computing the column-wise prefix-sums and then the row-wise prefix-sums. The main contribution of this paper is to introduce the asynchronous Hierarchical Memory Machine (asynchronous HMM), which supports asynchronous execution of CUDA blocks, and show a global-memory-access-optimal parallel algorithm for computing the SAT on the asynchronous HMM. A straightforward algorithm (2R2W SAT algorithm) on the asynchronous HMM, which computes the prefix-sums in every column using one thread each and then computes the prefix-sums in every row, performs 2 read operations and 2 write operations per element of a matrix. The previously published best algorithm (2R1W SAT algorithm) performs 2 read operations and 1 write operation per element. We present a more efficient algorithm (1R1W SAT algorithm) which performs 1 read operation and 1 write operation per element. Clearly, since every element in a matrix must be read at least once, and all resulting values must be written, our 1R1W SAT algorithm is optimal in terms of the global memory access. We also show a combined algorithm ((1 + r)R1W SAT algorithm) of 2R1W and 1R1W SAT algorithms that may have better performance. We have implemented several algorithms including 2R2W, 2R1W, 1R1W, (1 + r)R1W SAT algorithms on GeForce GTX 780 Ti. The experimental results show that our (1 + r)R1W SAT algorithm runs faster than any other SAT algorithms for large input matrices. Also, it runs more than 100 times faster than the best SAT algorithm using a single CPU.
Keywords :
computer vision; data structures; graphics processing units; matrix algebra; parallel algorithms; parallel architectures; storage management; (1 + r)R1W SAT algorithm; 2R2W SAT algorithm; CUDA blocks; CUDA-enabled GPU; asynchronous HMM; asynchronous execution; asynchronous hierarchical memory machine; column-wise prefix-sum; computer vision; data structure; global-memory-access-optimal parallel algorithm; hierarchical memory machine; parallel computing model; row- wise prefix-sum; summed area table; write operation; Computer architecture; Graphics processing units; Hidden Markov models; Instruction sets; Pipelines; Random access memory; Synchronization; CUDA; GPU; image processing; memory machine models; prefix-sums; summed area table;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Parallel Processing (ICPP), 2014 43rd International Conference on
Conference_Location :
Minneapolis MN
ISSN :
0190-3918
Type :
conf
DOI :
10.1109/ICPP.2014.34
Filename :
6957234
Link To Document :
بازگشت