DocumentCode
43519
Title
Kernelet: High-Throughput GPU Kernel Executions with Dynamic Slicing and Scheduling
Author
Jianlong Zhong ; Bingsheng He
Author_Institution
Sch. of Comput. Eng., Nanyang Technol. Univ., Singapore, Singapore
Volume
25
Issue
6
fYear
2014
fDate
Jun-14
Firstpage
1522
Lastpage
1532
Abstract
Graphics processors, or GPUs, have recently been widely used as accelerators in shared environments such as clusters and clouds. In such shared environments, many kernels are submitted to GPUs from different users, and throughput is an important metric for performance and total ownership cost. Despite recently improved runtime support for concurrent GPU kernel executions, the GPU can be severely underutilized, resulting in suboptimal throughput. In this paper, we propose Kernelet, a runtime system to improve the throughput of concurrent kernel executions on the GPU. Kernelet embraces transparent memory management and PCI-e data transfer techniques, and dynamic slicing and scheduling techniques for kernel executions. With slicing, Kernelet divides a GPU kernel into multiple sub-kernels (namely slices ). Each slice has tunable occupancy to allow co-scheduling with other slices for high GPU utilization. We develop a novel Markov chain-based performance model to guide the scheduling decision. Our experimental results demonstrate up to 31 percent and 23 percent performance improvement on NVIDIA Tesla C2050 and GTX680 GPUs, respectively.
Keywords
Markov processes; concurrency control; graphics processing units; operating system kernels; performance evaluation; processor scheduling; program slicing; storage management; GTX680 GPUs; Kernelet; Markov chain-based performance model; NVIDIA Tesla C2050; PCI-e data transfer techniques; concurrent GPU kernel executions; dynamic scheduling techniques; dynamic slicing techniques; graphics processors; high-throughput GPU kernel executions; runtime support; runtime system; shared environments; suboptimal throughput; total ownership cost; transparent memory management; Graphics processing units; Instruction sets; Kernel; Memory management; Optimal scheduling; Runtime; Throughput; GPGPU; Kernel slicing; Markov chain; performance modeling; task scheduling;
fLanguage
English
Journal_Title
Parallel and Distributed Systems, IEEE Transactions on
Publisher
ieee
ISSN
1045-9219
Type
jour
DOI
10.1109/TPDS.2013.257
Filename
6624111
Link To Document