DocumentCode :
625628
Title :
A Simplified and Accurate Model of Power-Performance Efficiency on Emergent GPU Architectures
Author :
Shuaiwen Song ; Chunyi Su ; Rountree, Barry ; Cameron, Kirk W.
Author_Institution :
Virginia Tech, Blacksburg, VA, USA
fYear :
2013
fDate :
20-24 May 2013
Firstpage :
673
Lastpage :
686
Abstract :
Emergent heterogeneous systems must be optimized for both power and performance at exascale. Massive parallelism combined with complex memory hierarchies form a barrier to efficient application and architecture design. These challenges are exacerbated with GPUs as parallelism increases orders of magnitude and power consumption can easily double. Models have been proposed to isolate power and performance bottlenecks and identify their root causes. However, no current models combine simplicity, accuracy, and support for emergent GPU architectures (e.g. NVIDIA Fermi). We combine hardware performance counter data with machine learning and advanced analytics to model power-performance efficiency for modern GPU-based systems. Our performance counter based approach is simpler than previous approaches and does not require detailed understanding of the underlying architecture. The resulting model is accurate for predicting power (within 2.1%) and performance (within 6.7%) for application kernels on modern GPUs. Our model can identify power-performance bottlenecks and their root causes for various complex computation and memory access patterns (e.g. global, shared, texture). We measure the accuracy of our power and performance models on a NVIDIA Fermi C2075 GPU for more than a dozen CUDA applications. We show our power model is more accurate and robust than the best available GPU power models - multiple linear regression models MLR and MLR+. We demonstrate how to use our models to identify power-performance bottlenecks and suggest optimization strategies for high-performance codes such as GEM, a biomolecular electrostatic analysis application. We verify our power-performance model is accurate on clusters of NVIDIA Fermi M2090s and useful for suggesting optimal runtime configurations on the Keeneland supercomputer at Georgia Tech.
Keywords :
graphics processing units; optimisation; parallel architectures; regression analysis; CUDA application; NVIDIA Fermi M2090s; complex memory hierarchy; emergent GPU architecture; emergent heterogeneous system; high-performance code; machine learning; memory access pattern; multiple linear regression model; optimization strategy; power consumption; power-performance efficiency; Adaptation models; Graphics processing units; Kernel; Predictive models; Radiation detectors; Runtime; Training;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Parallel & Distributed Processing (IPDPS), 2013 IEEE 27th International Symposium on
Conference_Location :
Boston, MA
ISSN :
1530-2075
Print_ISBN :
978-1-4673-6066-1
Type :
conf
DOI :
10.1109/IPDPS.2013.73
Filename :
6569853
Link To Document :
بازگشت