Title :
Optimizing loop performance for clustered VLIW architectures
Author :
Qian, Yi ; Carr, Steve ; Sweany, Philip
Author_Institution :
Dept. of Comput. Sci., Michigan Technol. Univ., Houghton, MI, USA
Abstract :
Modem embedded systems often require high degrees of instruction-level parallelism (ILP) within strict constraints on power consumption and chip cost. Unfortunately, a high-performance embedded processor with high ILP generally puts large demands on register resources, making it difficult to maintain a single, multi-ported register bank. To address this problem, some architectures, e.g. the Texas Instruments TMS320C6x, partition the register bank into multiple banks that are each directly connected only to a subset of functional units. These functional unit/register bank groups are called clusters. Clustered architectures require that either copy operations or delay slots be inserted when an operation accesses data stored on a different cluster In order to generate excellent code for such architectures, the compiler must not only spread the computation across clusters to achieve maximum parallelism, but also must limit the effects of intercluster data transfers. Loop unrolling and unroll-and-jam enhance the parallelism in loops to help limit the effects of intercluster data transfers. In this paper we describe an accurate metric for predicting the intercluster communication cost of a loop and present an integer-optimization problem that can be used to guide the application of unroll-and-jam and loop unrolling considering the effects of both ILP and intercluster data transfers. Our method achieves a harmonic mean speedup of 1.4-1.7 on software pipelined loops for both a simulated architecture and the TI TMS320C64x.
Keywords :
embedded systems; instruction sets; parallel architectures; chip cost; clustered VLIW architectures; delay slots; embedded systems; high-performance embedded processor; instruction-level parallelism; intercluster communication; loop performance optimisation; loop unrolling; multi-ported register bank; power consumption; simulated architecture; strict constraints; unroll-and-jam; Computer architecture; Costs; Delay effects; Embedded system; Energy consumption; Instruments; Modems; Parallel processing; Registers; VLIW;
Conference_Titel :
Parallel Architectures and Compilation Techniques, 2002. Proceedings. 2002 International Conference on
Print_ISBN :
0-7695-1620-3
DOI :
10.1109/PACT.2002.1106026