مرکز منطقه ای اطلاع رساني علوم و فناوري - Improving latency tolerance of multithreading through decoupling

DocumentCode :

1540056

Title :

Improving latency tolerance of multithreading through decoupling

Author :

Parcerisa, Joan-Manuel ; González, Antonio

Author_Institution :

Dept. d´´Arquitectura de Computadors, Univ. Politecnica de Catalunya, Barcelona, Spain

Volume :

Issue :

fYear :

2001

fDate :

10/1/2001 12:00:00 AM

Firstpage :

1084

Lastpage :

1094

Abstract :

The increasing hardware complexity of dynamically scheduled superscalar processors may compromise the scalability of this organization to make an efficient use of future increases in transistor budget. SMT processors, designed over a superscalar core, are therefore directly concerned by this problem. The article presents and evaluates a novel processor microarchitecture which combines two paradigms: simultaneous multithreading and access/execute decoupling. Since its decoupled units issue instructions in order, this architecture is significantly less complex, in terms of critical path delays, than a centralized out-of-order design, and it is more effective for future growth in issue-width and clock speed. We investigate how both techniques complement each other. Since decoupling features an excellent memory latency hiding efficiency, the large amount of parallelism exploited by multithreading may be used to hide the latency of functional units and keep them fully utilized. The study shows that, by adding decoupling to a multithreaded architecture, fewer threads are needed to achieve maximum throughput. Therefore, in addition to the obvious hardware complexity reduction, it places lower demands on the memory system. The study also reveals that multithreading by itself exhibits little memory latency tolerance. Results suggest that most of the latency hiding effectiveness of SMT architectures comes from the dynamic scheduling. On the other hand, decoupling is very effective at hiding memory latency. An increase in the cache miss penalty from 1 to 32 cycles reduces the performance of a 4-context multithreaded decoupled processor by less than 2 percent. For the nondecoupled multithreaded processor, the loss of performance is about 23 percent

Keywords :

instruction sets; multi-threading; parallel architectures; processor scheduling; SMT architectures; SMT processors; access/execute decoupling; cache miss penalty; clock speed; critical path delays; decoupling; dynamic scheduling; dynamically scheduled superscalar processors; functional units; hardware complexity reduction; instruction-level parallelism; issue-width; latency hiding effectiveness; latency tolerance; maximum throughput; memory latency hiding efficiency; memory latency tolerance; memory system performance; multithreaded architecture; parallelism; processor microarchitecture; simultaneous multithreading; superscalar core; Delay effects; Dynamic scheduling; Hardware; Microarchitecture; Multithreading; Out of order; Process design; Processor scheduling; Scalability; Surface-mount technology;

fLanguage :

English

Journal_Title :

Computers, IEEE Transactions on

Publisher :

ieee

ISSN :

0018-9340

Type :

jour

DOI :

10.1109/12.956093

Filename :

956093

Link To Document :

https://search.ricest.ac.ir/dl/search/defaultta.aspx?DTC=49&DC=1540056