Title :
Wire delay is not a problem for SMT (in the near future)
Author :
Vijaykumar, T.N. ; Chishti, Zeshan
Author_Institution :
Sch. of Electr. & Comput. Eng., Purdue Univ., West Lafayette, IN, USA
Abstract :
Previous papers have shown that the slow scaling of wire delays compared to logic delays will prevent superscalar performance from scaling with technology. In this paper, we show that the optimal pipeline for superscalar becomes shallower with technology, when wire delays are considered, tightening previous results that deeper pipelines perform only as well as shallower pipelines. The key reason for the lack of performance scaling is that superscalar does not have sufficient parallelism to hide the relatively-increased wire delays. However, Simultaneous Multithreading (SMT) provides the much-needed parallelism. We show that an SMT running a multiprogrammed workload with just 4-way issue not only retains the optimal pipeline depth over technology generations, enabling at least 43% increase in clock speed every generation, but also achieves the remainder of the expected speedup of two per generation through IPC. As wire delays become more dominant in future technologies, the number of programs needs to be scaled modestly to maintain the scaling trends, at least till the near-future 50nm technology. While this result ignores bandwidth constraints, using SMT to tolerate latency due to wire delays is not that simple because SMT causes bandwidth problems. Most of the stages of a modern out-of-order-issue pipeline employ RAM and CAM structures. Wire delays in conventional, latency-optimized RAM/CAM structures prevent them from being pipelined in a scaled manner. We show that this limitation prevents scaling of SMT throughput. We use bitline scaling to allow RAM/CAM bandwidth to scale with technology. Bitline scaling enables SMT throughput to scale at the rate of two per technology generation in the near future.
Keywords :
multi-threading; multiprocessing systems; parallel processing; pipeline processing; random-access storage; CAM structure; RAM structure; RAM/CAM bandwidth; bandwidth constraints; bitline scaling; clock speed; latency tolerance; latency-optimized RAM/CAM structures; logic delays; multiprogrammed workload; optimal pipeline depth; parallel processing; performance scaling; simultaneous multithreading; superscalar performance; wire delay scaling; Bandwidth; CADCAM; Computer aided manufacturing; Delay; Logic; Paper technology; Pipelines; Surface-mount technology; Throughput; Wire;
Conference_Titel :
Computer Architecture, 2004. Proceedings. 31st Annual International Symposium on
Print_ISBN :
0-7695-2143-6
DOI :
10.1109/ISCA.2004.1310762