• DocumentCode
    3204689
  • Title

    Automated Architecture-Aware Mapping of Streaming Applications Onto GPUs

  • Author

    Hagiescu, Andrei ; Huynh, Huynh Phung ; Wong, Weng-Fai ; Goh, Rick Siow Mong

  • Author_Institution
    Sch. of Comput., Nat. Univ. of Singapore, Singapore, Singapore
  • fYear
    2011
  • fDate
    16-20 May 2011
  • Firstpage
    467
  • Lastpage
    478
  • Abstract
    Graphic Processing Units (GPUs) are made up of many streaming multiprocessors, each consisting of processing cores that interleave the execution of a large number of threads. Groups of threads - called warps and wave fronts, respectively, in nVidia and AMD literature - are selected by the hardware scheduler and executed in lockstep on the available cores. If threads in such a group access the slow off-chip global memory, the entire group has to be stalled, and another group is scheduled instead. The utilization of a given multiprocessor will remain high if there is a sufficient number of alternative thread groups to select from. Many parallel general purpose applications have been efficiently mapped to GPUs. Unfortunately, many stream processing applications exhibit unfavorable data movement patterns and low computation-to-communication ratio that may lead to poor performance. In this paper, we describe an automated compilation flow that maps most stream processing applications onto GPUs by taking into consideration two important architectural features of nVidia GPUs, namely interleaved execution as well as the small amount of shared memory available in each streaming multiprocessors. In particular, we show that using a small number of compute threads such that the memory footprint is reduced, we can achieve high utilization of the GPU cores. Our scheme goes against the conventional wisdom of GPU programming which is to use a large number of homogeneous threads. Instead, it uses a mix of compute and memory access threads, together with a carefully crafted schedule that exploits parallelism in the streaming application, while maximizing the effectiveness of the unique memory hierarchy. We have implemented our scheme in the compiler of the Stream It programming language, and our results show a significant speedup compared to the state-of-the-art solutions.
  • Keywords
    coprocessors; multiprocessing systems; AMD; automated architecture-aware mapping; graphic processing units; hardware scheduler; interleaved execution; nVidia GPUs; processing cores; streaming multiprocessors; warps; wave fronts; Graphics processing unit; Hardware; Instruction sets; Kernel; Memory management; Schedules; Steady-state;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Parallel & Distributed Processing Symposium (IPDPS), 2011 IEEE International
  • Conference_Location
    Anchorage, AK
  • ISSN
    1530-2075
  • Print_ISBN
    978-1-61284-372-8
  • Electronic_ISBN
    1530-2075
  • Type

    conf

  • DOI
    10.1109/IPDPS.2011.52
  • Filename
    6012816