Title :
Breaking the on-chip latency barrier using SMART
Author :
Krishna, Tushar ; Chen, Chia-Hsin Owen ; Woo Cheol Kwon ; Li-Shiuan Peh
Author_Institution :
Comput. Sci. & Artificial Intell. Lab. (CSAIL), Massachusetts Inst. of Technol., Cambridge, MA, USA
Abstract :
As the number of on-chip cores increases, scalable on-chip topologies such as meshes inevitably add multiple hops in each network traversal. The best we can do right now is to design 1-cycle routers, such that the low-load network latency between a source and destination is equal to the number of routers + links (i.e. hops×2) between them. OS/compiler and cache coherence protocols designers often try to limit communication to within a few hops, since on-chip latency is critical for their scalability. In this work, we propose an on-chip network called SMART (Single-cycle Multi-hop Asynchronous Repeated Traversal) that aims to present a single-cycle data-path all the way from the source to the destination. We do not add any additional fast physical express links in the data-path; instead we drive the shared crossbars and links asynchronously up to multiple-hops within a single cycle. We design a router + link microarchitecture to achieve such a traversal, and a flow-control technique to arbitrate and setup multi-hop paths within a cycle. A place-and-routed design at 45nm achieves 11 hops within a 1GHz cycle for paths without turns (9 for paths with turns). We observe 5-8X reduction in low-load latencies across synthetic traffic patterns on an 8×8 CMP, compared to a baseline 1-cycle router. Full-system simulations with SPLASH-2 and PAR-SEC benchmarks demonstrate 27/52% and 20/59% reduction in runtime and EDP for Private/Shared L2 designs.
Keywords :
cache storage; computer architecture; microprocessor chips; network routing; program compilers; 1-cycle routers; CMP; EDP; OS-compiler; PARSEC benchmarks; SMART; SPLASH-2 benchmarks; cache coherence protocols designers; low-load network latency; on-chip cores; on-chip latency barrier; place-and-routed design; private-shared L2 designs; router+link microarchitecture; scalable on-chip topologies; shared crossbars; single-cycle multihop asynchronous repeated traversal; Delays; Pipelines; Repeaters; Runtime; Switches; System-on-chip; Wires;
Conference_Titel :
High Performance Computer Architecture (HPCA2013), 2013 IEEE 19th International Symposium on
Conference_Location :
Shenzhen
Print_ISBN :
978-1-4673-5585-8
DOI :
10.1109/HPCA.2013.6522334