DocumentCode :
3077340
Title :
Power-Check: An Energy-Efficient Checkpointing Framework for HPC Clusters
Author :
Chandrasekar, Raghunath Raja ; Venkatesh, Akshay ; Hamidouche, Khaled ; Panda, Dhabaleswar K.
Author_Institution :
Ohio State Univ., Columbus, OH, USA
fYear :
2015
fDate :
4-7 May 2015
Firstpage :
261
Lastpage :
270
Abstract :
Checkpoint-restart is a predominantly used reactive fault-tolerance mechanism for applications running on HPC systems. While there are innumerable studies in literature that have analyzed, and optimized for, the performance and scalability of a variety of check pointing protocols, not much research has been done from an energy or power perspective. The limited number of studies conducted along this line have primarily analyzed and modeled power and energy usage during check pointing phases. Applications running on future exascale machines will be constrained by a power envelope, and it is not only important to understand the behavior of check pointing systems under such an envelope but to also adopt techniques that can leverage power capping capabilities exposed by the OS to achieve energy savings without forsaking performance. In this paper, we address the problem of marginal energy benefits with significant performance degradation due to naive application of power capping around check pointing phases by proposing a novel power-aware check pointing framework -- Power-Check. By use of data funnelling mechanisms and selective core power-capping, Power-Check makes efficient use of the I/O and CPU subsystem. Evaluations with application kernels show that Power-Check can yield as much as 48% reduction in the amount of energy consumed during a checkpoint, while improving the check pointing performance by 14%.
Keywords :
checkpointing; parallel processing; power aware computing; CPU subsystem; HPC clusters; I/O subsystem; Power-Check; data funneling mechanisms; energy-efficient checkpointing framework; power-aware checkpointing framework; selective core power-capping; Checkpointing; Kernel; Libraries; Middleware; Protocols; Registers; Runtime; BLCR; DMTCP; RAPL; checkpointing; energy-efficiency; power-capping;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Cluster, Cloud and Grid Computing (CCGrid), 2015 15th IEEE/ACM International Symposium on
Conference_Location :
Shenzhen
Type :
conf
DOI :
10.1109/CCGrid.2015.169
Filename :
7152492
Link To Document :
بازگشت