• DocumentCode
    3077340
  • Title

    Power-Check: An Energy-Efficient Checkpointing Framework for HPC Clusters

  • Author

    Chandrasekar, Raghunath Raja ; Venkatesh, Akshay ; Hamidouche, Khaled ; Panda, Dhabaleswar K.

  • Author_Institution
    Ohio State Univ., Columbus, OH, USA
  • fYear
    2015
  • fDate
    4-7 May 2015
  • Firstpage
    261
  • Lastpage
    270
  • Abstract
    Checkpoint-restart is a predominantly used reactive fault-tolerance mechanism for applications running on HPC systems. While there are innumerable studies in literature that have analyzed, and optimized for, the performance and scalability of a variety of check pointing protocols, not much research has been done from an energy or power perspective. The limited number of studies conducted along this line have primarily analyzed and modeled power and energy usage during check pointing phases. Applications running on future exascale machines will be constrained by a power envelope, and it is not only important to understand the behavior of check pointing systems under such an envelope but to also adopt techniques that can leverage power capping capabilities exposed by the OS to achieve energy savings without forsaking performance. In this paper, we address the problem of marginal energy benefits with significant performance degradation due to naive application of power capping around check pointing phases by proposing a novel power-aware check pointing framework -- Power-Check. By use of data funnelling mechanisms and selective core power-capping, Power-Check makes efficient use of the I/O and CPU subsystem. Evaluations with application kernels show that Power-Check can yield as much as 48% reduction in the amount of energy consumed during a checkpoint, while improving the check pointing performance by 14%.
  • Keywords
    checkpointing; parallel processing; power aware computing; CPU subsystem; HPC clusters; I/O subsystem; Power-Check; data funneling mechanisms; energy-efficient checkpointing framework; power-aware checkpointing framework; selective core power-capping; Checkpointing; Kernel; Libraries; Middleware; Protocols; Registers; Runtime; BLCR; DMTCP; RAPL; checkpointing; energy-efficiency; power-capping;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Cluster, Cloud and Grid Computing (CCGrid), 2015 15th IEEE/ACM International Symposium on
  • Conference_Location
    Shenzhen
  • Type

    conf

  • DOI
    10.1109/CCGrid.2015.169
  • Filename
    7152492