• DocumentCode
    2028087
  • Title

    The Batched DOACROSS loop parallelization algorithm

  • Author

    Lucas, Divino Cesar S. ; Araujo, Guido

  • Author_Institution
    Inst. of Comput., Univ. of Campinas, Campinas, Brazil
  • fYear
    2015
  • fDate
    20-24 July 2015
  • Firstpage
    476
  • Lastpage
    483
  • Abstract
    Parallelizing loops containing loop-carried dependencies has been considered a very difficult task, mainly due to the overhead imposed by communicating dependencies between iterations. Despite the huge effort to devise effective parallelization techniques for such loops, the problem is still far from solved. For many loops, old (DOACROSS), and new (DSWP) techniques have not been able to offer a solution to this problem. This paper does a qualitative and quantitative analysis of synchronization costs of these two loop parallelization algorithms, on two modern computer architectures (ARM A9 MPCore and Intel Ivy Bridge). Our results show that at least 30% of the execution time of the programs we parallelized are spent on synchronization/data communication. We also show that, besides the problem being hard, these architectures are on opposite endpoints along the axis of commonly accepted requisites for efficient loop parallelization. As a consequence, both techniques struggle to effectively speed up several programs. Moreover, this paper presents a novel algorithm, called Batched DOACROSS (BDX), that capitalizes on the advantages of DSWP and DOACROSS, while minimizing their deficiencies. BDX does not require new hardware mechanisms (as DSWP does) and makes use of thread local buffers to reduce DOACROSS synchronization overheads.
  • Keywords
    iterative methods; microprocessor chips; multiprocessing systems; parallel architectures; synchronisation; ARM A9 MPCore; BDX; DOACROSS synchronization overheads; DSWP; Intel Ivy Bridge; batched DOACROSS loop parallelization algorithm; computer architectures; iterations; loop-carried dependencies; parallelizing loops; synchronization costs; synchronization-data communication; Algorithm design and analysis; Computer architecture; Instruction sets; Parallel processing; Synchronization; Fine-grain Parallelism; Loop Parallelization Algorithm; Multicore Processors;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    High Performance Computing & Simulation (HPCS), 2015 International Conference on
  • Conference_Location
    Amsterdam
  • Print_ISBN
    978-1-4673-7812-3
  • Type

    conf

  • DOI
    10.1109/HPCSim.2015.7237079
  • Filename
    7237079