Abstract :
Originally designed to be used as dedicated coprocessors, GPUs have progressively become part of shared computing environments, such as HPC servers and clusters. Commonly used GPU software stacks (e.g., CUDA and OpenCL), however, are designed for the dedicated use of GPUs by a single application, possibly leading to resource underutilization when multiple applications share the GPU resources. In recent years, several node-level runtime components have been proposed to target this problem and allow the efficient sharing of GPUs among concurrent applications. The concurrency enabled by these systems, however, is limited by synchronizations embedded in the applications or implicitly introduced by the GPU software stack. This work targets this problem. We first analyze the effect of explicit and implicit synchronizations on application concurrency and GPU utilization. We then design runtime mechanisms to bypass these synchronizations, along with a memory management scheme that can be integrated with these synchronization avoidance mechanisms to improve GPU utilization and system throughput. We integrate these mechanisms into a recently proposed GPU virtualization runtime named Sync-Free GPU (SF-GPU), thus removing unnecessary blockages caused by multitenancy, ensuring any two applications running on the same device experience limited to no interference, maximizing the level of concurrency supported. We also release our mechanisms in the form of a software API that can be used by programmers to improve the performance of their applications without modifying their code. Finally, we evaluate the impact of our proposed mechanisms on applications run in isolation and concurrently.
Keywords :
"Graphics processing units","Runtime","Kernel","Context","Concurrent computing","Hardware"