DocumentCode
587616
Title
Model-based, memory-centric performance and power optimization on NUMA multiprocessors
Author
Chunyi Su ; Dong Li ; Nikolopoulos, Dimitrios S. ; Cameron, Kirk W. ; de Supinski, Bronis R. ; Leon, Edgar A.
Author_Institution
Dept. of Comput. Sci., Virginia Tech, Blacksburg, VA, USA
fYear
2012
fDate
4-6 Nov. 2012
Firstpage
164
Lastpage
173
Abstract
Non-Uniform Memory Access (NUMA) architectures are ubiquitous in HPC systems. NUMA along with other factors including socket layout, data placement, and memory contention significantly increase the search space to find an optimal mapping of applications to NUMA systems. This search space may be intractable for online optimization and challenging for efficient offline search. This paper presents DyNUMA, a framework for dynamic optimization of programs on NUMA architectures. DyNUMA uses simple, memory-centric, performance and energy models with non-linear terms to capture the complex and interacting effects of system layout, program concurrency, data placement, and memory controller contention. DyNUMA leverages an artificial neural network (ANN) with input, output, and intermediate layers that emulate program threads, memory controllers, processor cores, and their interactions. Using an ANN in conjunction with critical path analysis, DyNUMA autonomously optimizes programs for performance or energy-efficiency metrics. We used DyNUMA on a variety of benchmarks from the NPB and ASC Sequoia suites on three different architectures (a 16-core AMD Barcelona system, a 32-core AMD Magny-Cours system, and a 64-core Tilera TilePro64 system). Our results show that DyNUMA achieves on average 8.7% improvement in performance (12.9% in the best case), 16% improvement in Energy-Delay (30.6% in the best case) and 9.1% improvement in MFLOPS/Watt (10.7% in the best case) compared to the default Linux scheduling.
Keywords
circuit optimisation; concurrency control; energy conservation; memory architecture; microprocessor chips; multiprocessing systems; neural nets; parallel processing; parallel programming; performance evaluation; power aware computing; 16-core AMD Barcelona system; 32-core AMD Magny-Cours system; 64-core Tilera TilePro64 system; ANN; ASC Sequoia suites; DyNUMA; HPC systems; NPB Sequoia suites; NUMA multiprocessors; artificial neural network; critical path analysis; data placement; dynamic program optimization; efficient offline search; energy-delay; energy-efficiency metrics; linux scheduling; memory contention; memory controller contention; memory controllers; model-based memory-centric performance optimization; model-based memory-centric power optimization; nonlinear terms; nonuniform memory access architectures; online optimization; optimal mapping; processor cores; program concurrency; program threads; search space; socket layout; Artificial neural networks; Concurrent computing; Hardware; Instruction sets; Measurement; Optimization; Sockets;
fLanguage
English
Publisher
ieee
Conference_Titel
Workload Characterization (IISWC), 2012 IEEE International Symposium on
Conference_Location
La Jolla, CA
Print_ISBN
978-1-4673-4531-6
Type
conf
DOI
10.1109/IISWC.2012.6402921
Filename
6402921
Link To Document