• DocumentCode
    611082
  • Title

    ECOFIT: A Framework to Estimate Energy Consumption of Fault Tolerance Protocols for HPC Applications

  • Author

    el Mehdi Diouri, M. ; Gluck, O. ; Lefevre, Laurent ; Cappello, Franck

  • Author_Institution
    Lab. de l´Inf. du Parallelisme, Univ. Lyon 1, Villeurbanne, France
  • fYear
    2013
  • fDate
    13-16 May 2013
  • Firstpage
    522
  • Lastpage
    529
  • Abstract
    Energy consumption and fault tolerance are two interrelated issues to address for designing future exascale systems. Fault tolerance protocols used for check pointing have different energy consumption depending on parameters like application features, number of processes in the execution and platform characteristics. Currently, the only way to select a protocol for a given execution is to run the application and monitor the energy consumption of different fault tolerance protocols. This is needed for any variation of the execution setting. To avoid this time and energy consuming process, we propose an energy estimation framework. It relies on an energy calibration of the considered platform and a user description of the execution setting. We evaluate the accuracy of our estimations with real applications running on a real platform with energy consumption monitoring. Results show that our estimations are highly accurate and allow selecting the best fault tolerant protocol without pre-executing the application.
  • Keywords
    checkpointing; energy consumption; fault tolerant computing; parallel processing; ECOFIT; HPC applications; check pointing; energy calibration; energy consumption estimation; energy consumption monitoring; exascale systems; fault tolerance protocols; Calibration; Checkpointing; Energy consumption; Fault tolerance; Fault tolerant systems; Power demand; Protocols; Checkpoint/Restart; Energy Consumption; Estimation; Fault tolerance protocols; Performance;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Cluster, Cloud and Grid Computing (CCGrid), 2013 13th IEEE/ACM International Symposium on
  • Conference_Location
    Delft
  • Print_ISBN
    978-1-4673-6465-2
  • Type

    conf

  • DOI
    10.1109/CCGrid.2013.80
  • Filename
    6546134