DocumentCode
3129688
Title
Wire delay is not a problem for SMT (in the near future)
Author
Vijaykumar, T.N. ; Chishti, Zeshan
Author_Institution
Sch. of Electr. & Comput. Eng., Purdue Univ., West Lafayette, IN, USA
fYear
2004
fDate
19-23 June 2004
Firstpage
40
Lastpage
51
Abstract
Previous papers have shown that the slow scaling of wire delays compared to logic delays will prevent superscalar performance from scaling with technology. In this paper, we show that the optimal pipeline for superscalar becomes shallower with technology, when wire delays are considered, tightening previous results that deeper pipelines perform only as well as shallower pipelines. The key reason for the lack of performance scaling is that superscalar does not have sufficient parallelism to hide the relatively-increased wire delays. However, Simultaneous Multithreading (SMT) provides the much-needed parallelism. We show that an SMT running a multiprogrammed workload with just 4-way issue not only retains the optimal pipeline depth over technology generations, enabling at least 43% increase in clock speed every generation, but also achieves the remainder of the expected speedup of two per generation through IPC. As wire delays become more dominant in future technologies, the number of programs needs to be scaled modestly to maintain the scaling trends, at least till the near-future 50nm technology. While this result ignores bandwidth constraints, using SMT to tolerate latency due to wire delays is not that simple because SMT causes bandwidth problems. Most of the stages of a modern out-of-order-issue pipeline employ RAM and CAM structures. Wire delays in conventional, latency-optimized RAM/CAM structures prevent them from being pipelined in a scaled manner. We show that this limitation prevents scaling of SMT throughput. We use bitline scaling to allow RAM/CAM bandwidth to scale with technology. Bitline scaling enables SMT throughput to scale at the rate of two per technology generation in the near future.
Keywords
multi-threading; multiprocessing systems; parallel processing; pipeline processing; random-access storage; CAM structure; RAM structure; RAM/CAM bandwidth; bandwidth constraints; bitline scaling; clock speed; latency tolerance; latency-optimized RAM/CAM structures; logic delays; multiprogrammed workload; optimal pipeline depth; parallel processing; performance scaling; simultaneous multithreading; superscalar performance; wire delay scaling; Bandwidth; CADCAM; Computer aided manufacturing; Delay; Logic; Paper technology; Pipelines; Surface-mount technology; Throughput; Wire;
fLanguage
English
Publisher
ieee
Conference_Titel
Computer Architecture, 2004. Proceedings. 31st Annual International Symposium on
ISSN
1063-6897
Print_ISBN
0-7695-2143-6
Type
conf
DOI
10.1109/ISCA.2004.1310762
Filename
1310762
Link To Document