Breaking the bandwidth wall in chip multiprocessors

Author

Vega, Augusto ; Cabarcas, Felipe ; Ramírez, Alex ; Valero, Mateo

Author_Institution

Barcelona Supercomput. Center, Univ. Politec. de Catalunya, Barcelona, Spain

fYear

2011

fDate

18-21 July 2011

Firstpage

255

Lastpage

262

Abstract

In throughput-aware CMPs like GPUs and DSPs, software-managed streaming memory systems are an effective way to tolerate high latencies. E.g., the Cell/B.E. incorporates local memories, and data transfers to/from those memories are overlapped with computation using DMAs. In such designs, the latency of the memory system has little impact on performance; instead, memory bandwidth becomes critical. With the increase in the number of cores, conventional DRAMs no longer suffice to satisfy the bandwidth demand. Hence, recent throughput-aware CMPs adopted caches to filter off-chip traffic. However, such caches are optimized for latency, not bandwidth. This work presents a re-design of the memory system in throughput-aware CMPs. Instead of a traditional latency-aware cache, we propose to spread the address space using fine-grained interleaving all over a shared non-coherent last-level cache (LLC). In this way, on-chip storage is optimally used, with no need to keep coherency. On the memory side, we also propose the use of interleaving across DRAMs but with a much finer granularity than usual page-size approaches. Our proposal is highly optimized for bandwidth, not latency, by avoiding data replication in the LLC and by using fine-grained address space interleaving in both the LLC and the memory. For a CMP with 128 cores and 64-MB LLC, performance is improved by 21% due to the LLC optimizations and an extra 42% due to the off-chip memory optimizations, for a total 1.7 times performance improvement.

Keywords

cache storage; microprocessor chips; multiprocessing systems; DSP; GPU; LLC optimization; bandwidth demand; chip multiprocessors; data replication; digital signal processors; dynamic memory allocation; graphics processing unit; last-level cache; off-chip memory optimization; on-chip storage; software-managed streaming memory systems; Bandwidth; Coherence; Computer architecture; Organizations; Program processors; Proposals; Random access memory;

fLanguage

English

Publisher

ieee

Conference_Titel

Embedded Computer Systems (SAMOS), 2011 International Conference on

Conference_Location

Samos

Print_ISBN

978-1-4577-0802-2

Electronic_ISBN

978-1-4577-0801-5

Type

conf

DOI

10.1109/SAMOS.2011.6045469

Filename

6045469