• DocumentCode
    3663935
  • Title

    The Load Slice Core microarchitecture

  • Author

    Trevor E. Carlson;Wim Heirman;Osman Allam;Stefanos Kaxiras;Lieven Eeckhout

  • Author_Institution
    Uppsala University, Sweden
  • fYear
    2015
  • fDate
    6/1/2015 12:00:00 AM
  • Firstpage
    272
  • Lastpage
    284
  • Abstract
    Driven by the motivation to expose instruction-level parallelism (ILP), microprocessor cores have evolved from simple, in-order pipelines into complex, superscalar out-of-order designs. By extracting ILP, these processors also enable parallel cache and memory operations as a useful side-effect. Today, however, the growing off-chip memory wall and complex cache hierarchies of many-core processors make cache and memory accesses ever more costly. This increases the importance of extracting memory hierarchy parallelism (MHP), while reducing the net impact of more general, yet complex and power-hungry ILP-extraction techniques. In addition, for multi-core processors operating in power- and energy-constrained environments, energy-efficiency has largely replaced single-thread performance as the primary concern. Based on this observation, we propose a core microarchitecture that is aimed squarely at generating parallel accesses to the memory hierarchy while maximizing energy efficiency. The Load Slice Core extends the efficient in-order, stall-on-use core with a second in-order pipeline that enables memory accesses and address-generating instructions to bypass stalled instructions in the main pipeline. Backward program slices containing address-generating instructions leading up to loads and stores are extracted automatically by the hardware, using a novel iterative algorithm that requires no software support or recompilation. On average, the Load Slice Core improves performance over a baseline in-order processor by 53% with overheads of only 15% in area and 22% in power, leading to an increase in energy efficiency (MIPS/Watt) over in-order and out-of-order designs by 43% and over 4.7×, respectively. In addition, for a power- and area-constrained many-core design, the Load Slice Core outperforms both in-order and out-of-order designs, achieving a 53% and 95% higher performance, respectively, thus providing an alternative direction for future many-core processors.
  • Keywords
    "Registers","Out of order","Parallel processing","Random access memory","Radio frequency","Microarchitecture"
  • Publisher
    ieee
  • Conference_Titel
    Computer Architecture (ISCA), 2015 ACM/IEEE 42nd Annual International Symposium on
  • Type

    conf

  • DOI
    10.1145/2749469.2750407
  • Filename
    7284072