Title :
Modestly faster histogram computations on GPUs
Author :
Brown, Shawn ; Snoeyink, Jack
Author_Institution :
UNC Chapel Hill, Columbia, NC, USA
Abstract :
We present TRISH, a 256-bin histogram method for byte data that runs up to 50% faster than previous GPU methods for random data and 2-4× faster for image data. The performance gains come from reducing total cycle counts. Reducing cycles comes from improving 1) thread level parallelism (TLP), 2) instruction level parallelism (ILP) and 3) software vector parallelism (VP). TLP is improved by increasing occupancy from 2 to 3 thread blocks, achieved by compacting “per thread” histograms in shared memory, and by using register arrays. ILP is improved by increasing independent instructions via loop unrolling by a factor of k= [1..63] and batching operations into groups of four. VP is supported by compacting bin counts into four 8-bit quads per 32-bit element and reducing binning & accumulating instructions by working with 32-bit elements as overlapping 16-bit pairs instead of 4 individual bytes. Note that TRISH is a deterministic algorithm that avoids atomic operations and gives performance that is data independent.
Keywords :
deterministic algorithms; graphics processing units; shared memory systems; 256-bin histogram method; GPU; ILP; TLP; TRISH; VP; binning reduction; byte data; deterministic algorithm; histogram computations; image data; instruction accumulation; instruction level parallelism; loop unrolling; random data; register arrays; shared memory; software vector parallelism; thread blocks; thread level parallelism; total cycle count reduction; Abstracts; Buildings; Kernel; Registers; Throughput; CUDA; GPU; Histogram; Parallel Processing;
Conference_Titel :
Innovative Parallel Computing (InPar), 2012
Conference_Location :
San Jose, CA
Print_ISBN :
978-1-4673-2632-2
Electronic_ISBN :
978-1-4673-2631-5
DOI :
10.1109/InPar.2012.6339589