مرکز منطقه ای اطلاع رساني علوم و فناوري - Scheduling Methods for Accelerating Applications on Architectures with Heterogeneous Cores

DocumentCode :

167219

Title :

Scheduling Methods for Accelerating Applications on Architectures with Heterogeneous Cores

Author :

Linchuan Chen ; Xin Huo ; Agrawal, Gagan

Author_Institution :

Dept. of Comput. Sci. & Eng., Ohio State Univ., Columbus, OH, USA

fYear :

2014

fDate :

19-23 May 2014

Firstpage :

Lastpage :

Abstract :

Intra-node architectures for high performance machines have been rapidly evolving over the recent years. We are seeing a diverse set of architectures, most of them with heterogeneous cores. This leads to two important questions for HPC programming: 1) how do we accelerate a single application using a heterogeneous collection of cores? 2) how do we develop applications and runtime systems that maintain portability and performance portability across a diverse set of architectures? By looking at the characteristics of the architectures, we formulate a scheduling problem with non-coherent, non-uniform access shared memory. We have developed several distinct locking-free dynamic scheduling methods for such cases, including a master-worker method, a core-level token passing method, and a device-level token passing method. These methods have been implemented on two different architectures: an AMD Fusion APU, and a decoupled CPU-GPU node involving a multi-core CPU and two NVIDIA GPUs, and thus, we provide portability for applications across different types of heterogeneous systems. Using six different applications, we compare different scheduling models against multi-core CPU-only and many-core GPU-only versions. The best CPU+1GPU version has a speedup of between 1.11 and 1.88 over the better of the single device versions. The CPU+2GPU execution further improves the performance by 1.25 × to 1.79 × over the CPU+1GPU versions. Comparing against the scheduling methods implemented in StarPU and OmpSs, two recent systems for CPU-GPU scheduling, we see 1.08 × to 1.21 × speedups over the faster of StarPU and OmpSs executions for each application.

Keywords :

graphics processing units; multiprocessing systems; parallel processing; processor scheduling; shared memory systems; AMD fusion APU; CPU-GPU scheduling; HPC programming; NVIDIA GPU; OmpSs; StarPU; accelerating applications; core-level token passing method; decoupled CPU-GPU node; device-level token passing method; heterogeneous collection; heterogeneous cores; high performance machines; intra-node architectures; locking-free dynamic scheduling methods; many-core GPU-only versions; master-worker method; multicore CPU-only versions; nonuniform access shared memory; performance portability; runtime systems; scheduling models; scheduling problem; Central Processing Unit; Computer architecture; Dynamic scheduling; Graphics processing units; Instruction sets; Kernel; Programming; Heterogeneous CPU-GPU Architectures; Locking-free; Scheduling;

fLanguage :

English

Publisher :

ieee

Conference_Titel :

Parallel & Distributed Processing Symposium Workshops (IPDPSW), 2014 IEEE International

Conference_Location :

Phoenix, AZ

Print_ISBN :

978-1-4799-4117-9

Type :

conf

DOI :

10.1109/IPDPSW.2014.11

Filename :

6969370

Link To Document :

https://search.ricest.ac.ir/dl/search/defaultta.aspx?DTC=49&DC=167219