Title :
Undersubscribed threading on clustered cache architectures
Author :
Heirman, W. ; Carlson, Trevor E. ; Van Craeynest, Kenzo ; Hur, Ibrahim ; Jaleel, Aamer ; Eeckhout, Lieven
Author_Institution :
Ghent Univ., Ghent, Belgium
Abstract :
Recent many-core processors such as Intel´s Xeon Phi and GPGPUs specialize in running highly scalable parallel applications at high performance while simultaneously embracing energy efficiency as a first-order design constraint. The traditional belief is that full utilization of all available cores also translates into the highest possible performance. In this paper, we study the effects of cache capacity conflicts and competition for shared off-chip bandwidth; and show that undersubscription, or not utilizing all cores, often yields significant increases in both performance and energy efficiency. Based on a detailed shared working set analysis we make the case for clustered cache architectures as an efficient design point for exploiting both data sharing and undersubscription, while providing low-latency and ease of implementation in many-core processors. We then propose ClusteR-aware Undersubscribed Scheduling of Threads (CRUST) which dynamically matches an application´s working set size and off-chip bandwidth demands with the available on-chip cache capacity and off-chip bandwidth. CRUST improves application performance and energy efficiency by 15% on average, and up to 50%, for the NPB and SPEC OMP benchmarks. In addition, we make recommendations for the design of future many-core architectures, and show that taking the undersubscription usage model into account moves the optimum performance under the cores-versus-cache area tradeoff towards design points with more cores and less cache.
Keywords :
cache storage; multiprocessing systems; processor scheduling; CRUST; ClusteR-aware undersubscribed scheduling of threads; GPGPU; Intel Xeon Phi; NPB benchmarks; SPEC OMP benchmarks; cache capacity conflict effect; clustered cache architectures; data sharing; design points; energy efficiency; first-order design constraint; many-core processors; on-chip cache capacity; shared off-chip bandwidth; shared working set analysis; undersubscribed threading; undersubscription usage model; Bandwidth; Benchmark testing; Computer architecture; Dynamic scheduling; Instruction sets; Subscriptions;
Conference_Titel :
High Performance Computer Architecture (HPCA), 2014 IEEE 20th International Symposium on
Conference_Location :
Orlando, FL
DOI :
10.1109/HPCA.2014.6835975