• DocumentCode
    1954863
  • Title

    Policy-based tuning for performance portability and library co-optimization

  • Author

    Merrill, Duane ; Garland, Michael ; Grimshaw, Andrew

  • Author_Institution
    NVIDIA Corp., Santa Clara, CA, USA
  • fYear
    2012
  • fDate
    13-14 May 2012
  • Firstpage
    1
  • Lastpage
    10
  • Abstract
    Although modular programming is a fundamental software development practice, software reuse within contemporary GPU kernels is uncommon. For GPU software assets to be reusable across problem instances, they must be inherently flexible and tunable. To illustrate, we survey the performance-portability landscape for a suite of common GPU primitives, evaluating thousands of reasonable program variants across a large diversity of problem instances (microarchitecture, problem size, and data type). While individual specializations provide excellent performance for specific instances, we find no variants with “universally reasonable” performance. In this paper, we present a policy-based design idiom for constructing reusable, tunable software components that can be co-optimized with the enclosing kernel for the specific problem and processor at hand. In particular, this approach enables flexible granularity coarsening which allows the expensive aspects of communication and the redundant aspects of data parallelism to scale with the width of the processor rather than the problem size. From a small library of tunable device subroutines, we have constructed the fastest, most versatile GPU primitives for reduction, prefix and segmented scan, duplicate removal, reduction-by-key, sorting, and sparse graph traversal.
  • Keywords
    graphics processing units; parallel processing; performance evaluation; software reusability; contemporary GPU kernels; data parallelism; data type; duplicate removal; flexible granularity coarsening; library; library co-optimization; microarchitecture; modular programming; performance portability; policy-based design; policy-based tuning; prefix scan; problem size; reduction-by-key; segmented scan; software development practice; sorting; sparse graph traversal; tunable software components reusability; Graphics processing unit; Instruction sets; Kernel; Parallel processing; Registers; Tiles; Tuning; Performance; auto tuning; library design; metaprogramming; performance portability; policy; software reuse;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Innovative Parallel Computing (InPar), 2012
  • Conference_Location
    San Jose, CA
  • Print_ISBN
    978-1-4673-2632-2
  • Electronic_ISBN
    978-1-4673-2631-5
  • Type

    conf

  • DOI
    10.1109/InPar.2012.6339597
  • Filename
    6339597