• DocumentCode
    3307950
  • Title

    Optimizing SIMD Parallel Computation with Non-Consecutive Array Access in Inline SSE Assembly Language

  • Author

    Juan, Chen ; Canqun, Yang

  • Author_Institution
    Sch. of Comput., Nat. Univ. of Defense Technol., Changsha, China
  • fYear
    2012
  • fDate
    12-14 Jan. 2012
  • Firstpage
    254
  • Lastpage
    257
  • Abstract
    Many processors, such as Intel Xeon processor 5100 series, AMD Athlon 64, support SIMD computation model with the Streaming SIMD Extensions (SSE), SSE2 and SSE3. Using double-precision SSE/SSE2/SSE3 instructions simultaneously can handle two packed double-precision floating-point data elements with 128-bit XMM vector registers, which greatly improves floating-point performance. Sometimes non-consecutive data instead of consecutive ones appear in SIMD computation, which prevents SIMD optimization. That is because two non-consecutive double precision floating-point data elements cannot be loaded into 128-bit vector registers simultaneously and they have to be loaded for twice. How to implement SIMD optimization for non-consecutive data is our concern. Loop unrolling exposes the rule and characteristics of such non-consecutive data. Register rotation can help transform non-consecutive data to vector data. Based on a representative kernel program, we illustrate our SIMD optimization combining loop unrolling with register rotation. Through vectorizing non-consecutive data, the performance of "KERNEL" code is improved by 42.4% and PQMRCGSTAB application is improved by 15.3%.
  • Keywords
    floating point arithmetic; microprocessor chips; parallel processing; program compilers; AMD Athlon 64; Intel Xeon processor 5100 series; SIMD computation model; SIMD optimization; SIMD parallel computation optimisation; SSE; XMM vector registers; floating-point data; inline SSE assembly language; nonconsecutive array access; streaming SIMD extensions; Arrays; Assembly; Kernel; Optimization; Program processors; Registers; Vectors; SIMD; SSE/SSE2/SSE3; inline assembly; loop unrolling; nonconsecutive data; register rotation;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Intelligent Computation Technology and Automation (ICICTA), 2012 Fifth International Conference on
  • Conference_Location
    Zhangjiajie, Hunan
  • Print_ISBN
    978-1-4673-0470-2
  • Type

    conf

  • DOI
    10.1109/ICICTA.2012.70
  • Filename
    6150189