Wire delay is not a problem for SMT (in the near future)

Author

Vijaykumar, T.N. ; Chishti, Zeshan

Author_Institution

Sch. of Electr. & Comput. Eng., Purdue Univ., West Lafayette, IN, USA

fYear

2004

fDate

19-23 June 2004

Firstpage

40

Lastpage

51

Abstract

Previous papers have shown that the slow scaling of wire delays compared to logic delays will prevent superscalar performance from scaling with technology. In this paper, we show that the optimal pipeline for superscalar becomes shallower with technology, when wire delays are considered, tightening previous results that deeper pipelines perform only as well as shallower pipelines. The key reason for the lack of performance scaling is that superscalar does not have sufficient parallelism to hide the relatively-increased wire delays. However, Simultaneous Multithreading (SMT) provides the much-needed parallelism. We show that an SMT running a multiprogrammed workload with just 4-way issue not only retains the optimal pipeline depth over technology generations, enabling at least 43% increase in clock speed every generation, but also achieves the remainder of the expected speedup of two per generation through IPC. As wire delays become more dominant in future technologies, the number of programs needs to be scaled modestly to maintain the scaling trends, at least till the near-future 50nm technology. While this result ignores bandwidth constraints, using SMT to tolerate latency due to wire delays is not that simple because SMT causes bandwidth problems. Most of the stages of a modern out-of-order-issue pipeline employ RAM and CAM structures. Wire delays in conventional, latency-optimized RAM/CAM structures prevent them from being pipelined in a scaled manner. We show that this limitation prevents scaling of SMT throughput. We use bitline scaling to allow RAM/CAM bandwidth to scale with technology. Bitline scaling enables SMT throughput to scale at the rate of two per technology generation in the near future.

Keywords

multi-threading; multiprocessing systems; parallel processing; pipeline processing; random-access storage; CAM structure; RAM structure; RAM/CAM bandwidth; bandwidth constraints; bitline scaling; clock speed; latency tolerance; latency-optimized RAM/CAM structures; logic delays; multiprogrammed workload; optimal pipeline depth; parallel processing; performance scaling; simultaneous multithreading; superscalar performance; wire delay scaling; Bandwidth; CADCAM; Computer aided manufacturing; Delay; Logic; Paper technology; Pipelines; Surface-mount technology; Throughput; Wire;

fLanguage

English

Publisher

ieee

Conference_Titel

Computer Architecture, 2004. Proceedings. 31st Annual International Symposium on

ISSN

1063-6897

Print_ISBN

0-7695-2143-6

Type

conf

DOI

10.1109/ISCA.2004.1310762

Filename

1310762