DocumentCode
3077340
Title
Power-Check: An Energy-Efficient Checkpointing Framework for HPC Clusters
Author
Chandrasekar, Raghunath Raja ; Venkatesh, Akshay ; Hamidouche, Khaled ; Panda, Dhabaleswar K.
Author_Institution
Ohio State Univ., Columbus, OH, USA
fYear
2015
fDate
4-7 May 2015
Firstpage
261
Lastpage
270
Abstract
Checkpoint-restart is a predominantly used reactive fault-tolerance mechanism for applications running on HPC systems. While there are innumerable studies in literature that have analyzed, and optimized for, the performance and scalability of a variety of check pointing protocols, not much research has been done from an energy or power perspective. The limited number of studies conducted along this line have primarily analyzed and modeled power and energy usage during check pointing phases. Applications running on future exascale machines will be constrained by a power envelope, and it is not only important to understand the behavior of check pointing systems under such an envelope but to also adopt techniques that can leverage power capping capabilities exposed by the OS to achieve energy savings without forsaking performance. In this paper, we address the problem of marginal energy benefits with significant performance degradation due to naive application of power capping around check pointing phases by proposing a novel power-aware check pointing framework -- Power-Check. By use of data funnelling mechanisms and selective core power-capping, Power-Check makes efficient use of the I/O and CPU subsystem. Evaluations with application kernels show that Power-Check can yield as much as 48% reduction in the amount of energy consumed during a checkpoint, while improving the check pointing performance by 14%.
Keywords
checkpointing; parallel processing; power aware computing; CPU subsystem; HPC clusters; I/O subsystem; Power-Check; data funneling mechanisms; energy-efficient checkpointing framework; power-aware checkpointing framework; selective core power-capping; Checkpointing; Kernel; Libraries; Middleware; Protocols; Registers; Runtime; BLCR; DMTCP; RAPL; checkpointing; energy-efficiency; power-capping;
fLanguage
English
Publisher
ieee
Conference_Titel
Cluster, Cloud and Grid Computing (CCGrid), 2015 15th IEEE/ACM International Symposium on
Conference_Location
Shenzhen
Type
conf
DOI
10.1109/CCGrid.2015.169
Filename
7152492
Link To Document