Model-based, memory-centric performance and power optimization on NUMA multiprocessors

Author

Chunyi Su ; Dong Li ; Nikolopoulos, Dimitrios S. ; Cameron, Kirk W. ; de Supinski, Bronis R. ; Leon, Edgar A.

Author_Institution

Dept. of Comput. Sci., Virginia Tech, Blacksburg, VA, USA

fYear

2012

fDate

4-6 Nov. 2012

Firstpage

164

Lastpage

173

Abstract

Non-Uniform Memory Access (NUMA) architectures are ubiquitous in HPC systems. NUMA along with other factors including socket layout, data placement, and memory contention significantly increase the search space to find an optimal mapping of applications to NUMA systems. This search space may be intractable for online optimization and challenging for efficient offline search. This paper presents DyNUMA, a framework for dynamic optimization of programs on NUMA architectures. DyNUMA uses simple, memory-centric, performance and energy models with non-linear terms to capture the complex and interacting effects of system layout, program concurrency, data placement, and memory controller contention. DyNUMA leverages an artificial neural network (ANN) with input, output, and intermediate layers that emulate program threads, memory controllers, processor cores, and their interactions. Using an ANN in conjunction with critical path analysis, DyNUMA autonomously optimizes programs for performance or energy-efficiency metrics. We used DyNUMA on a variety of benchmarks from the NPB and ASC Sequoia suites on three different architectures (a 16-core AMD Barcelona system, a 32-core AMD Magny-Cours system, and a 64-core Tilera TilePro64 system). Our results show that DyNUMA achieves on average 8.7% improvement in performance (12.9% in the best case), 16% improvement in Energy-Delay (30.6% in the best case) and 9.1% improvement in MFLOPS/Watt (10.7% in the best case) compared to the default Linux scheduling.

Keywords

circuit optimisation; concurrency control; energy conservation; memory architecture; microprocessor chips; multiprocessing systems; neural nets; parallel processing; parallel programming; performance evaluation; power aware computing; 16-core AMD Barcelona system; 32-core AMD Magny-Cours system; 64-core Tilera TilePro64 system; ANN; ASC Sequoia suites; DyNUMA; HPC systems; NPB Sequoia suites; NUMA multiprocessors; artificial neural network; critical path analysis; data placement; dynamic program optimization; efficient offline search; energy-delay; energy-efficiency metrics; linux scheduling; memory contention; memory controller contention; memory controllers; model-based memory-centric performance optimization; model-based memory-centric power optimization; nonlinear terms; nonuniform memory access architectures; online optimization; optimal mapping; processor cores; program concurrency; program threads; search space; socket layout; Artificial neural networks; Concurrent computing; Hardware; Instruction sets; Measurement; Optimization; Sockets;

fLanguage

English

Publisher

ieee

Conference_Titel

Workload Characterization (IISWC), 2012 IEEE International Symposium on

Conference_Location

La Jolla, CA

Print_ISBN

978-1-4673-4531-6

Type

conf

DOI

10.1109/IISWC.2012.6402921

Filename

6402921

Link To Document

https://search.isc.ac/dl/search/defaultta.aspx?DTC=49&DC=587616