• DocumentCode
    1755766
  • Title

    Performance Modeling of Atomic Additions on GPU Scratchpad Memory

  • Author

    Gomez-Luna, Juan ; Gonzalez-Linares, Jose Mo ; Benavides Benitez, Jose Ignacio ; Guil Mata, N.

  • Author_Institution
    Dept. of Comput. Archit. & Electron., Univ. of Cordoba, Cordoba, Spain
  • Volume
    24
  • Issue
    11
  • fYear
    2013
  • fDate
    Nov. 2013
  • Firstpage
    2273
  • Lastpage
    2282
  • Abstract
    GPU application implementations using scatter approaches will fall into write contention due to atomic updates of output elements, if these result from more than one input element. Colliding threads will be serialized, seriously harming performance. Dealing with these issues requires a proper understanding of the behavior of the scratchpad or shared memory under conflicting accesses caused by concurrent threads. Thus, this paper presents an exhaustive microbenchmark-based analysis of atomic additions in shared memory that quantifies the impact of access conflicts on latency and throughput. This analysis has led us to discover the lock mechanism that enables atomic updates to shared memory and to propose a performance model to estimate the latency penalties due to collisions by position or bank conflicts. Then, we have derived experiments from this model that show us the way to optimize applications using atomic operations. Position and bank conflicts can be diminished by replication and padding, respectively. The benefits of such techniques are illustrated with the optimization of two widely used voting processes: the centroid updating step in k-means clustering, and histogram calculation.
  • Keywords
    concurrency control; graphics processing units; pattern clustering; shared memory systems; storage management; GPU application; GPU scratchpad memory; access conflict; application optimization; atomic addition; atomic operation; atomic update; bank conflict; centroid updating step; concurrent thread; conflicting access; exhaustive microbenchmark-based analysis; histogram calculation; input element; k-means clustering; latency penalty; lock mechanism; output elements; performance modeling; position conflict; scatter approach; scratchpad memory behavior; shared memory behavior; thread collision; voting process; write contention; Atomic clocks; Atomic measurements; Graphics processing units; Instruction sets; Message systems; Throughput; CUDA; GPU; K-means; Performance model; atomic operations; histogram; shared memory;
  • fLanguage
    English
  • Journal_Title
    Parallel and Distributed Systems, IEEE Transactions on
  • Publisher
    ieee
  • ISSN
    1045-9219
  • Type

    jour

  • DOI
    10.1109/TPDS.2012.319
  • Filename
    6378364