Title :
Improving GPGPU resource utilization through alternative thread block scheduling
Author :
Minseok Lee ; Seokwoo Song ; Joosik Moon ; Kim, Jung-Ho ; Woong Seo ; Yeongon Cho ; Soojung Ryu
Author_Institution :
KAIST, Daejeon, South Korea
Abstract :
High performance in GPGPU workloads is obtained by maximizing parallelism and fully utilizing the available resources. The thousands of threads are assigned to each core in units of CTA (Cooperative Thread Arrays) or thread blocks - with each thread block consisting of multiple warps or wavefronts. The scheduling of the threads can have significant impact on overall performance. In this work, explore alternative thread block or CTA scheduling; in particular, we exploit the interaction between the thread block scheduler and the warp scheduler to improve performance. We explore two aspects of thread block scheduling - (1) LCS (lazy CTA scheduling) which restricts the maximum number of thread blocks allocated to each core, and (2) BCS (block CTA scheduling) where consecutive thread blocks are assigned to the same core. For LCS, we leverage a greedy warp scheduler to help determine the optimal number of thread blocks by only measuring the number of instructions issued while for BCS, we propose an alternative warp scheduler that is aware of the “block” of CTAs allocated to a core. With LCS and the observation that maximum number of CTAs does not necessary maximize performance, we also propose mixed concurrent kernel execution that enables multiple kernels to be allocated to the same core to maximize resource utilization and improve overall performance.
Keywords :
concurrency control; graphics processing units; multi-threading; resource allocation; scheduling; BCS; GPGPU resource utilization improvement; GPGPU workloads; LCS; block CTA scheduling; cooperative thread arrays; greedy warp scheduler; high-performance computing; lazy CTA scheduling; mixed concurrent kernel execution; multiple wavefronts; optimal thread blocks; parallelism maximization; performance improvement; resource utilization maximization; thread block scheduler; thread-block scheduling; Graphics processing units; Hardware; Instruction sets; Kernel; Memory management; Resource management; Scheduling;
Conference_Titel :
High Performance Computer Architecture (HPCA), 2014 IEEE 20th International Symposium on
Conference_Location :
Orlando, FL
DOI :
10.1109/HPCA.2014.6835937