مرکز منطقه ای اطلاع رساني علوم و فناوري - Towards realizing the potential of malleable jobs

DocumentCode :

3591187

Title :

Towards realizing the potential of malleable jobs

Author :

Gupta, Abhishek ; Acun, Bilge ; Sarood, Osman ; Kale, Laxmikant V.

Author_Institution :

Dept. of Comput. Sci., Univ. of Illinois at Urbana-Champaign, Urbana, IL, USA

fYear :

2014

Firstpage :

Lastpage :

Abstract :

Malleable jobs are those which can dynamically shrink or expand the number of processors on which they are executing at runtime in response to an external command. Malleable jobs can significantly improve system utilization and reduce average response time, compared to traditional jobs. To realize these benefits, three components are critical - an adaptive job scheduler, an adaptive resource manager, and an adaptive parallel runtime system. In this paper, we present a novel mechanism for enabling shrink/expand capability in the parallel runtime system using task migration and dynamic load balancing, checkpoint-restart, and Linux shared memory. Our technique performs true shrink/expand eliminating the need of any residual processes, requires little application programmer effort, and is fast. Further, we establish a bidirectional communication channel between the resource manager and the parallel runtime, and present an asynchronous split-phase mechanism for executing adaptive scheduling decisions. Performance results using Charm++ on Stampede supercomputer show the efficacy, scalability, and benefits of our approach. Shrinking from 2k to 1k cores takes 16s while expand from 1k to 2k takes 40s. Also, we demonstrate the utility of our runtime in traditional as well as emerging scenarios, e.g., proactive fault tolerance and clouds.

Keywords :

Linux; checkpointing; parallel processing; processor scheduling; resource allocation; Charm++; Linux shared memory; Stampede supercomputer; adaptive job scheduler; adaptive parallel runtime system; adaptive resource manager; adaptive scheduling decision execution; asynchronous split-phase mechanism; average response time reduction; bidirectional communication channel; checkpoint-restart; dynamic load balancing; dynamic processor expansion; dynamic processor shrinkage; external command; malleable jobs; parallel runtime system; resource manager; system utilization improvement; task migration; Adaptive systems; Linux; Load management; Program processors; Protocols; Runtime; Synchronization;

fLanguage :

English

Publisher :

ieee

Conference_Titel :

High Performance Computing (HiPC), 2014 21st International Conference on

Print_ISBN :

978-1-4799-5975-4

Type :

conf

DOI :

10.1109/HiPC.2014.7116905

Filename :

7116905

Link To Document :

https://search.ricest.ac.ir/dl/search/defaultta.aspx?DTC=49&DC=3591187