Parallel Algorithms for the Summed Area Table on the Asynchronous Hierarchical Memory Machine, with GPU implementations

Author

Kasagi, Akihiko ; Nakano, Kaoru ; Ito, Yu

Author_Institution

Dept. of Inf. Eng., Hiroshima Univ., Higashi-Hiroshima, Japan

fYear

2014

fDate

9-12 Sept. 2014

Firstpage

251

Lastpage

260

Abstract

The Hierarchical Memory Machine (HMM) is a theoretical parallel computing model that captures the essence of computing on CUDA-enabled GPUs. The summed area table (SAT) of a matrix is a data structure frequently used in the area of computer vision which can be obtained by computing the column-wise prefix-sums and then the row-wise prefix-sums. The main contribution of this paper is to introduce the asynchronous Hierarchical Memory Machine (asynchronous HMM), which supports asynchronous execution of CUDA blocks, and show a global-memory-access-optimal parallel algorithm for computing the SAT on the asynchronous HMM. A straightforward algorithm (2R2W SAT algorithm) on the asynchronous HMM, which computes the prefix-sums in every column using one thread each and then computes the prefix-sums in every row, performs 2 read operations and 2 write operations per element of a matrix. The previously published best algorithm (2R1W SAT algorithm) performs 2 read operations and 1 write operation per element. We present a more efficient algorithm (1R1W SAT algorithm) which performs 1 read operation and 1 write operation per element. Clearly, since every element in a matrix must be read at least once, and all resulting values must be written, our 1R1W SAT algorithm is optimal in terms of the global memory access. We also show a combined algorithm ((1 + r)R1W SAT algorithm) of 2R1W and 1R1W SAT algorithms that may have better performance. We have implemented several algorithms including 2R2W, 2R1W, 1R1W, (1 + r)R1W SAT algorithms on GeForce GTX 780 Ti. The experimental results show that our (1 + r)R1W SAT algorithm runs faster than any other SAT algorithms for large input matrices. Also, it runs more than 100 times faster than the best SAT algorithm using a single CPU.

Keywords

computer vision; data structures; graphics processing units; matrix algebra; parallel algorithms; parallel architectures; storage management; (1 + r)R1W SAT algorithm; 2R2W SAT algorithm; CUDA blocks; CUDA-enabled GPU; asynchronous HMM; asynchronous execution; asynchronous hierarchical memory machine; column-wise prefix-sum; computer vision; data structure; global-memory-access-optimal parallel algorithm; hierarchical memory machine; parallel computing model; row- wise prefix-sum; summed area table; write operation; Computer architecture; Graphics processing units; Hidden Markov models; Instruction sets; Pipelines; Random access memory; Synchronization; CUDA; GPU; image processing; memory machine models; prefix-sums; summed area table;

fLanguage

English

Publisher

ieee

Conference_Titel

Parallel Processing (ICPP), 2014 43rd International Conference on

Conference_Location

Minneapolis MN

ISSN

0190-3918

Type

conf

DOI

10.1109/ICPP.2014.34

Filename

6957234