DocumentCode :
1540056
Title :
Improving latency tolerance of multithreading through decoupling
Author :
Parcerisa, Joan-Manuel ; González, Antonio
Author_Institution :
Dept. d´´Arquitectura de Computadors, Univ. Politecnica de Catalunya, Barcelona, Spain
Volume :
50
Issue :
10
fYear :
2001
fDate :
10/1/2001 12:00:00 AM
Firstpage :
1084
Lastpage :
1094
Abstract :
The increasing hardware complexity of dynamically scheduled superscalar processors may compromise the scalability of this organization to make an efficient use of future increases in transistor budget. SMT processors, designed over a superscalar core, are therefore directly concerned by this problem. The article presents and evaluates a novel processor microarchitecture which combines two paradigms: simultaneous multithreading and access/execute decoupling. Since its decoupled units issue instructions in order, this architecture is significantly less complex, in terms of critical path delays, than a centralized out-of-order design, and it is more effective for future growth in issue-width and clock speed. We investigate how both techniques complement each other. Since decoupling features an excellent memory latency hiding efficiency, the large amount of parallelism exploited by multithreading may be used to hide the latency of functional units and keep them fully utilized. The study shows that, by adding decoupling to a multithreaded architecture, fewer threads are needed to achieve maximum throughput. Therefore, in addition to the obvious hardware complexity reduction, it places lower demands on the memory system. The study also reveals that multithreading by itself exhibits little memory latency tolerance. Results suggest that most of the latency hiding effectiveness of SMT architectures comes from the dynamic scheduling. On the other hand, decoupling is very effective at hiding memory latency. An increase in the cache miss penalty from 1 to 32 cycles reduces the performance of a 4-context multithreaded decoupled processor by less than 2 percent. For the nondecoupled multithreaded processor, the loss of performance is about 23 percent
Keywords :
instruction sets; multi-threading; parallel architectures; processor scheduling; SMT architectures; SMT processors; access/execute decoupling; cache miss penalty; clock speed; critical path delays; decoupling; dynamic scheduling; dynamically scheduled superscalar processors; functional units; hardware complexity reduction; instruction-level parallelism; issue-width; latency hiding effectiveness; latency tolerance; maximum throughput; memory latency hiding efficiency; memory latency tolerance; memory system performance; multithreaded architecture; parallelism; processor microarchitecture; simultaneous multithreading; superscalar core; Delay effects; Dynamic scheduling; Hardware; Microarchitecture; Multithreading; Out of order; Process design; Processor scheduling; Scalability; Surface-mount technology;
fLanguage :
English
Journal_Title :
Computers, IEEE Transactions on
Publisher :
ieee
ISSN :
0018-9340
Type :
jour
DOI :
10.1109/12.956093
Filename :
956093
Link To Document :
بازگشت