• DocumentCode
    703825
  • Title

    Inter-tile reuse optimization applied to bandwidth constrained embedded accelerators

  • Author

    Peemen, Maurice ; Mesman, Bart ; Corporaal, Henk

  • Author_Institution
    Dept. of Electr. Eng., Eindhoven Univ. of Technol., Eindhoven, Netherlands
  • fYear
    2015
  • fDate
    9-13 March 2015
  • Firstpage
    169
  • Lastpage
    174
  • Abstract
    The adoption of High-Level Synthesis (HLS) tools has significantly reduced accelerator design time. A complex scaling problem that remains is the data transfer bottleneck. To scale-up performance accelerators require huge amounts of data, and are often limited by interconnect resources. In addition, the energy spent by the accelerator is often dominated by the transfer of data, either in the form of memory references or data movement on interconnect. In this paper we drastically reduce accelerator communication by exploration of computation reordering and local buffer usage. Consequently, we present a new analytical methodology to optimize nested loops for inter-tile data reuse with loop transformations like interchange and tiling. We focus on embedded accelerators that can be used in a multi-accelerator System on Chip (SoC), so performance, area, and energy are key in this exploration. 1) On three common embedded applications in the image/video processing domain (demosaicing, block matching, object detection), we show that our methodology reduces data movement up to 2.1x compared to the best case of intra-tile optimization. 2) We demonstrate that our small accelerators (1-3% FPGA resources) can boost a simple MicroBlaze soft-core to the performance level of a high-end Intel-i7 processor.
  • Keywords
    buffer storage; circuit optimisation; embedded systems; graphics processing units; high level synthesis; multiprocessor interconnection networks; system-on-chip; MicroBlaze soft core; bandwidth constrained embedded accelerator; buffer usage; complex scaling problem; data transfer; embedded applications; high-end Intel-i7 processor; high-level synthesis; inter-tile reuse optimization; interconnect resource; loop transformation; multiaccelerator SoC; nested loop optimization; system on chip; Arrays; Bismuth; Cost function; Data transfer; Schedules; Steady-state;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Design, Automation & Test in Europe Conference & Exhibition (DATE), 2015
  • Conference_Location
    Grenoble
  • Print_ISBN
    978-3-9815-3704-8
  • Type

    conf

  • Filename
    7092377