Title :
Scheduling Methods for Accelerating Applications on Architectures with Heterogeneous Cores
Author :
Linchuan Chen ; Xin Huo ; Agrawal, Gagan
Author_Institution :
Dept. of Comput. Sci. & Eng., Ohio State Univ., Columbus, OH, USA
Abstract :
Intra-node architectures for high performance machines have been rapidly evolving over the recent years. We are seeing a diverse set of architectures, most of them with heterogeneous cores. This leads to two important questions for HPC programming: 1) how do we accelerate a single application using a heterogeneous collection of cores? 2) how do we develop applications and runtime systems that maintain portability and performance portability across a diverse set of architectures? By looking at the characteristics of the architectures, we formulate a scheduling problem with non-coherent, non-uniform access shared memory. We have developed several distinct locking-free dynamic scheduling methods for such cases, including a master-worker method, a core-level token passing method, and a device-level token passing method. These methods have been implemented on two different architectures: an AMD Fusion APU, and a decoupled CPU-GPU node involving a multi-core CPU and two NVIDIA GPUs, and thus, we provide portability for applications across different types of heterogeneous systems. Using six different applications, we compare different scheduling models against multi-core CPU-only and many-core GPU-only versions. The best CPU+1GPU version has a speedup of between 1.11 and 1.88 over the better of the single device versions. The CPU+2GPU execution further improves the performance by 1.25 × to 1.79 × over the CPU+1GPU versions. Comparing against the scheduling methods implemented in StarPU and OmpSs, two recent systems for CPU-GPU scheduling, we see 1.08 × to 1.21 × speedups over the faster of StarPU and OmpSs executions for each application.
Keywords :
graphics processing units; multiprocessing systems; parallel processing; processor scheduling; shared memory systems; AMD fusion APU; CPU-GPU scheduling; HPC programming; NVIDIA GPU; OmpSs; StarPU; accelerating applications; core-level token passing method; decoupled CPU-GPU node; device-level token passing method; heterogeneous collection; heterogeneous cores; high performance machines; intra-node architectures; locking-free dynamic scheduling methods; many-core GPU-only versions; master-worker method; multicore CPU-only versions; nonuniform access shared memory; performance portability; runtime systems; scheduling models; scheduling problem; Central Processing Unit; Computer architecture; Dynamic scheduling; Graphics processing units; Instruction sets; Kernel; Programming; Heterogeneous CPU-GPU Architectures; Locking-free; Scheduling;
Conference_Titel :
Parallel & Distributed Processing Symposium Workshops (IPDPSW), 2014 IEEE International
Conference_Location :
Phoenix, AZ
Print_ISBN :
978-1-4799-4117-9
DOI :
10.1109/IPDPSW.2014.11