• DocumentCode
    1820773
  • Title

    Breaking the bandwidth wall in chip multiprocessors

  • Author

    Vega, Augusto ; Cabarcas, Felipe ; Ramírez, Alex ; Valero, Mateo

  • Author_Institution
    Barcelona Supercomput. Center, Univ. Politec. de Catalunya, Barcelona, Spain
  • fYear
    2011
  • fDate
    18-21 July 2011
  • Firstpage
    255
  • Lastpage
    262
  • Abstract
    In throughput-aware CMPs like GPUs and DSPs, software-managed streaming memory systems are an effective way to tolerate high latencies. E.g., the Cell/B.E. incorporates local memories, and data transfers to/from those memories are overlapped with computation using DMAs. In such designs, the latency of the memory system has little impact on performance; instead, memory bandwidth becomes critical. With the increase in the number of cores, conventional DRAMs no longer suffice to satisfy the bandwidth demand. Hence, recent throughput-aware CMPs adopted caches to filter off-chip traffic. However, such caches are optimized for latency, not bandwidth. This work presents a re-design of the memory system in throughput-aware CMPs. Instead of a traditional latency-aware cache, we propose to spread the address space using fine-grained interleaving all over a shared non-coherent last-level cache (LLC). In this way, on-chip storage is optimally used, with no need to keep coherency. On the memory side, we also propose the use of interleaving across DRAMs but with a much finer granularity than usual page-size approaches. Our proposal is highly optimized for bandwidth, not latency, by avoiding data replication in the LLC and by using fine-grained address space interleaving in both the LLC and the memory. For a CMP with 128 cores and 64-MB LLC, performance is improved by 21% due to the LLC optimizations and an extra 42% due to the off-chip memory optimizations, for a total 1.7 times performance improvement.
  • Keywords
    cache storage; microprocessor chips; multiprocessing systems; DSP; GPU; LLC optimization; bandwidth demand; chip multiprocessors; data replication; digital signal processors; dynamic memory allocation; graphics processing unit; last-level cache; off-chip memory optimization; on-chip storage; software-managed streaming memory systems; Bandwidth; Coherence; Computer architecture; Organizations; Program processors; Proposals; Random access memory;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Embedded Computer Systems (SAMOS), 2011 International Conference on
  • Conference_Location
    Samos
  • Print_ISBN
    978-1-4577-0802-2
  • Electronic_ISBN
    978-1-4577-0801-5
  • Type

    conf

  • DOI
    10.1109/SAMOS.2011.6045469
  • Filename
    6045469