• DocumentCode
    639334
  • Title

    Dynamic memory access monitoring based on tagged memory

  • Author

    Dathathri, Roshan ; Reddy, Chandan ; Ramashekar, Thejas ; Bondhugula, Uday

  • Author_Institution
    Dept. of Comput. Sci. & Autom., Indian Inst. of Sci., Bangalore, India
  • fYear
    2013
  • fDate
    7-11 Sept. 2013
  • Firstpage
    409
  • Lastpage
    410
  • Abstract
    Programming for parallel architectures that do not have a shared address space is extremely difficult due to the need for explicit communication between memories of different compute devices. A heterogeneous system with CPUs and multiple GPUs, or a distributed-memory cluster are examples of such systems. Past works that try to automate data movement for distributed-memory architectures can lead to excessive redundant communication. In this paper, we propose an automatic data movement scheme that minimizes the volume of communication between compute devices in heterogeneous and distributed-memory systems. We show that by partitioning data dependences in a particular non-trivial way, one can generate data movement code that results in the minimum volume for a vast majority of cases. The techniques are applicable to any sequence of affine loop nests and works on top of any choice of loop transformations, parallelization, and computation placement. The data movement code generated minimizes the volume of communication for a particular configuration of these. We use a combination of powerful static analyses relying on the polyhedral compiler framework and lightweight runtime routines they generate, to build a source-to-source transformation tool that automatically generates communication code. We demonstrate that the tool is scalable and leads to substantial gains in efficiency. On a heterogeneous system, the communication volume is reduced by a factor of 11× to 83× over state-of-the-art, translating into a mean execution time speedup of 1.53×. On a distributed-memory cluster, our scheme reduces the communication volume by a factor of 1.4× to 63.5× over state-of-the-art, resulting in a mean speedup of 1.55×. In addition, our scheme yields a mean speedup of 2.19× over hand-optimized UPC codes.
  • Keywords
    distributed memory systems; parallel architectures; parallel programming; program compilers; program control structures; program diagnostics; CPUs; GPUs; affine loop nests; automatic data movement scheme; communication code generation; computation placement; data dependence partitioning; data movement code generation; distributed-memory architectures; distributed-memory cluster; distributed-memory systems; hand-optimized UPC codes; heterogeneous architectures; heterogeneous memory systems; heterogeneous system; lightweight runtime routines; loop transformations; parallel architectures; polyhedral compiler framework; programming; source-to-source transformation tool; static analyses; Computational modeling; Computer architecture; Distributed databases; Program processors; Receivers; Runtime; Tiles; buffer overflow; memory access monitoring; memory corruption; tagged memory; vliw;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Parallel Architectures and Compilation Techniques (PACT), 2013 22nd International Conference on
  • Conference_Location
    Edinburgh
  • ISSN
    1089-795X
  • Print_ISBN
    978-1-4799-1018-2
  • Type

    conf

  • DOI
    10.1109/PACT.2013.6618833
  • Filename
    6618833