Title :
Asynchronous Memory Machine Models with Barrier Synchronization
Author_Institution :
Dept. of Inf. Eng., Hiroshima Univ., Higashi Hiroshima, Japan
Abstract :
The Discrete Memory Machine (DMM) and the Unified Memory Machine (UMM) are theoretical parallel computing models that capture the essence of the shared memory and the global memory of GPUs. It was assumed that warps (i.e. groups of threads) on the DMM and the UMM work synchronously in the round-robin manner. However, warps work asynchronously in the actual GPUs, in the sense that warps may be randomly (or arbitrarily) dispatched for execution. The first contribution of this paper is to introduce an asynchronous version of the DMM and the UMM, in which warps are arbitrarily dispatched. Instead, we assume that threads can execute the “syncthreads” instruction for barrier synchronization. Since the barrier synchronization operation is costly, we should evaluate and minimize the number of barrier synchronization operations performed by parallel algorithms. The second contribution of this paper is to show a parallel algorithm to compute the sum of n numbers in optimal computing time and few barrier synchronization steps. Our parallel algorithm computes the sum of n numbers in O(n/w+ l log n) time units and O(log l/w + log log w) barrier synchronization steps using wl threads both on the asynchronous DMM and on the asynchronous UMM with width w and latency l. We also prove that the computing time is optimal because it matches the theoretical lower bound. Quite surprisingly, the number of barrier synchronization steps and the number of threads are independent of n. Even if the input size n is quite large, our parallel algorithm computes the sum in optimal time units and a fixed number of syncthreads using a fixed number of threads.
Keywords :
computational complexity; graphics processing units; parallel algorithms; shared memory systems; synchronisation; GPU; asynchronous DMM; asynchronous UMM; asynchronous memory machine model; barrier synchronization operation; discrete memory machine; global memory; graphical processing unit; optimal computing time; parallel algorithm; parallel computing model; round-robin manner; shared memory; syncthread instruction; unified memory machine; Computational modeling; Graphics processing units; Instruction sets; Parallel algorithms; Random access memory; Synchronization; CUDA; GPU; asynchronous models; contiguous memory access; parallel algorithms; parallel computing models;
Conference_Titel :
Networking and Computing (ICNC), 2012 Third International Conference on
Conference_Location :
Okinawa
Print_ISBN :
978-1-4673-4624-5
DOI :
10.1109/ICNC.2012.18