• DocumentCode
    2483166
  • Title

    High-order stencil computations on multicore clusters

  • Author

    Peng, Liu ; Seymour, Richard ; Nomura, Ken-ichi ; Kalia, Rajiv K. ; Nakano, Aiichiro ; Vashishta, Priya ; Loddoch, Alexander ; Netzband, Michael ; Volz, William R. ; Wong, Chap C.

  • Author_Institution
    Dept. of Comput. Sci., Univ. of Southern California, Los Angeles, CA, USA
  • fYear
    2009
  • fDate
    23-29 May 2009
  • Firstpage
    1
  • Lastpage
    11
  • Abstract
    Stencil computation (SC) is of critical importance for broad scientific and engineering applications. However, it is a challenge to optimize complex, high-order SC on emerging clusters of multicore processors. We have developed a hierarchical SC parallelization framework that combines: (1) spatial decomposition based on message passing; (2) multithreading using critical section-free, dual representation; and (3) single-instruction multiple-data (SIMD) parallelism based on various code transformations. Our SIMD transformations include translocated statement fusion, vector composition via shuffle, and vectorized data layout reordering (e.g. matrix transpose), which are combined with traditional optimization techniques such as loop unrolling. We have thereby implemented two SCs of different characteristics-diagonally dominant, lattice Boltzmann method (LBM) for fluid flow simulation and highly off-diagonal (6-th order) finite-difference time-domain (FDTD) code for seismic wave propagation-on a Cell Broadband Engine (Cell BE) based system (a cluster of PlayStation3 consoles), a dual Intel quadcore platform, and IBM BlueGene/L and P. We have achieved high inter-node and intra-node (multithreading and SIMD) scalability for the diagonally dominant LBM: Weak-scaling parallel efficiency 0.978 on 131,072 BlueGene/P processors; strong-scaling multithreading efficiency 0.882 on 6 cores of Cell BE; and strong-scaling SIMD efficiency 0.780 using 4-element vector registers of Cell BE. Implementation of the high-order SC, on the contrary, is less efficient due to long-stride memory access and the limited size of the vector register file, which points out the need for further optimizations.
  • Keywords
    message passing; multi-threading; cell broadband engine; code transformations; dual Intel quadcore platform; finite-difference time-domain code; hierarchical stencil computation parallelization; high-order stencil computations; lattice Boltzmann method; message passing; multicore clusters; multicore processors; multithreading; seismic wave propagation; single-instruction multiple-data parallelism; vector composition; vector register file; vectorized data layout reordering; Finite difference methods; Fluid flow; Lattice Boltzmann methods; Matrix decomposition; Message passing; Multicore processing; Multithreading; Parallel processing; Registers; Time domain analysis;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Parallel & Distributed Processing, 2009. IPDPS 2009. IEEE International Symposium on
  • Conference_Location
    Rome
  • ISSN
    1530-2075
  • Print_ISBN
    978-1-4244-3751-1
  • Electronic_ISBN
    1530-2075
  • Type

    conf

  • DOI
    10.1109/IPDPS.2009.5161011
  • Filename
    5161011