DocumentCode
3673296
Title
ParaMASK: A Multi-Agent System for the efficient and dynamic adaptation of HPC workloads
Author
Mateusz Guzek;Xavier Besseron;Sébastien Varrette;Grégoire Danoy;Pascal Bouvry
Author_Institution
Interdisciplinary Centre for Security Reliability and Trust, 6, rue Richard Coudenhove-Kalergi, L-1359 Luxembourg, Luxembourg
fYear
2014
Firstpage
275
Lastpage
281
Abstract
The growing parallelism and heterogeneity of modern computing infrastructures such as High Performance Computing (HPC) platforms raises new challenges to their programmers and users. Additional requirements have emerged nowadays, such as minimizing the consumed energy, reducing the utilized system resources, or providing built-in reliability mechanisms. Therefore High Performance Computing (HPC) applications require adaptation mechanisms and then must avoid traditional monolithic centralized approaches in favor of novel autonomous, flexible and decentralized decision systems. In this context, we describe here a dynamic and flexible adaptation scheme based on a Multi-Agent System (MAS) to handle parallel or distributed executions in an HPC environment. More precisely, we model and extend the existing HPC middleware Kaapi to offer the power of the ParaMoise multi-agent organizational framework. Our proposed solution, named ParaMASK, relies on the similarities between ParaMoise workflow-based functional specifications and the Direct Acyclic Graph (DAG) representation of the distributed execution within Kaapi. As a result, ParaMASK permits to analyze and reorganize the scheduling of tasks that compose a program in an autonomous and decentralized way, while additionally handling dynamic adaptations (using task migration to fulfill energy consumption goals for example). The proposed solution was implemented on top of the existing Kaapi middleware and includes an optimized algorithm for the agent coordination. ParaMASK has been validated with a series of experiments on a real computational grid. Experimental results show a good scalability and an exceptional low overhead induced by the approach: less than 1.5% execution time increase with periodic coordinations every 15 seconds on 2662 cores.
Keywords
"Organizations","Monitoring","Computational modeling","Scheduling","System-on-chip"
Publisher
ieee
Conference_Titel
Signal Processing and Information Technology (ISSPIT), 2014 IEEE International Symposium on
ISSN
2162-7843
Type
conf
DOI
10.1109/ISSPIT.2014.7300600
Filename
7300600
Link To Document