Kernelet: High-Throughput GPU Kernel Executions with Dynamic Slicing and Scheduling

Author

Jianlong Zhong ; Bingsheng He

Author_Institution

Sch. of Comput. Eng., Nanyang Technol. Univ., Singapore, Singapore

Volume

25

Issue

6

fYear

2014

fDate

Jun-14

Firstpage

1522

Lastpage

1532

Abstract

Graphics processors, or GPUs, have recently been widely used as accelerators in shared environments such as clusters and clouds. In such shared environments, many kernels are submitted to GPUs from different users, and throughput is an important metric for performance and total ownership cost. Despite recently improved runtime support for concurrent GPU kernel executions, the GPU can be severely underutilized, resulting in suboptimal throughput. In this paper, we propose Kernelet, a runtime system to improve the throughput of concurrent kernel executions on the GPU. Kernelet embraces transparent memory management and PCI-e data transfer techniques, and dynamic slicing and scheduling techniques for kernel executions. With slicing, Kernelet divides a GPU kernel into multiple sub-kernels (namely slices ). Each slice has tunable occupancy to allow co-scheduling with other slices for high GPU utilization. We develop a novel Markov chain-based performance model to guide the scheduling decision. Our experimental results demonstrate up to 31 percent and 23 percent performance improvement on NVIDIA Tesla C2050 and GTX680 GPUs, respectively.

Keywords

Markov processes; concurrency control; graphics processing units; operating system kernels; performance evaluation; processor scheduling; program slicing; storage management; GTX680 GPUs; Kernelet; Markov chain-based performance model; NVIDIA Tesla C2050; PCI-e data transfer techniques; concurrent GPU kernel executions; dynamic scheduling techniques; dynamic slicing techniques; graphics processors; high-throughput GPU kernel executions; runtime support; runtime system; shared environments; suboptimal throughput; total ownership cost; transparent memory management; Graphics processing units; Instruction sets; Kernel; Memory management; Optimal scheduling; Runtime; Throughput; GPGPU; Kernel slicing; Markov chain; performance modeling; task scheduling;

fLanguage

English

Journal_Title

Parallel and Distributed Systems, IEEE Transactions on

Publisher

ieee

ISSN

1045-9219

Type

jour

DOI

10.1109/TPDS.2013.257

Filename

6624111