DocumentCode
3200346
Title
Investigating the Interplay between Energy Efficiency and Resilience in High Performance Computing
Author
Li Tan ; Song, Shuaiwen Leon ; Panruo Wu ; Zizhong Chen ; Rong Ge ; Kerbyson, Darren J.
Author_Institution
Univ. of California, Riverside, Riverside, CA, USA
fYear
2015
fDate
25-29 May 2015
Firstpage
786
Lastpage
796
Abstract
Energy efficiency and resilience are two crucial challenges for HPC systems to reach exactable. While energy efficiency and resilience issues have been extensively studied individually, little has been done to understand the interplay between energy efficiency and resilience for HPC systems. Decreasing the supply voltage associated with a given operating frequency for processors and other CMOS-based components can significantly reduce power consumption. However, this often raises system failure rates and consequently increases application execution time. In this work, we present an energy saving undervaluing approach that leverages the mainstream resilience techniques to tolerate the increased failures caused by undervaluing. Our strategy is directed by analytic models, which capture the impact of undervaluing and the interplay between energy efficiency and resilience. Experimental results on a power-aware cluster demonstrate that our approach can save up to 12.1% energy compared to the baseline, and conserve up to 9.1% more energy than a state-of-the-art DVFS solution.
Keywords
energy conservation; parallel processing; power aware computing; power consumption; CMOS-based components; HPC systems; analytic models; application execution time; energy efficiency; energy saving undervaluing approach; failures; high performance computing; operating frequency; power consumption; power-aware cluster; processors; resilience; supply voltage; system failure rates; Circuit faults; Error correction codes; Hardware; Mathematical model; Optimized production technology; Program processors; Resilience; HPC; energy; failures; resilience; undervolting;
fLanguage
English
Publisher
ieee
Conference_Titel
Parallel and Distributed Processing Symposium (IPDPS), 2015 IEEE International
Conference_Location
Hyderabad
ISSN
1530-2075
Type
conf
DOI
10.1109/IPDPS.2015.108
Filename
7161565
Link To Document