Decoupled vector architectures

Author

Espasa, Roger ; Valero, Mateo

Author_Institution

Dept. d´´Arquitectura de Computadors, Univ. Politecnica de Catalunya, Barcelona, Spain

fYear

1996

Firstpage

281

Lastpage

290

Abstract

The purpose of this paper is to show that using decoupling techniques in a vector processor, the performance of vector programs can be greatly improved. Using a trace driven approach, we simulate a selection of the Perfect Club programs and compare their execution time on a conventional vector architecture and on a decoupled vector architecture. Decoupling provides a performance advantage of more than a factor of two for realistic memory latencies, and even with an ideal memory system with no latency, there is still a speedup of as much as 50%. A bypassing technique between the load/store queues is introduced and we show how it can give up to an extra speedup of 22% while also reducing total memory traffic by an average of 20%. An important part of this paper is devoted to study the tradeoffs involved in choosing an adequate size for the different queues of the architecture, so that the hardware cost of the queues can be minimized while still retaining most of the performance advantages of decoupling

Keywords

performance evaluation; vector processor systems; Perfect Club programs; bypassing technique; decoupled vector architectures; hardware cost; performance; performance advantages; realistic memory latencies; total memory traffic; trace driven approach; vector processor; Computational modeling; Computer aided instruction; Computer architecture; Costs; Delay; Hardware; Multithreading; Parallel processing; Vector processors; Yarn;

fLanguage

English

Publisher

ieee

Conference_Titel

High-Performance Computer Architecture, 1996. Proceedings., Second International Symposium on

Conference_Location

San Jose, CA

Print_ISBN

0-8186-7237-4

Type

conf

DOI

10.1109/HPCA.1996.501193

Filename

501193