• DocumentCode
    3643422
  • Title

    Communication Optimizations for Distributed-Memory X10 Programs

  • Author

    Rajkishore Barik;Jisheng Zhao;David Grove;Igor Peshansky;Zoran Budimlic;Vivek Sarkar

  • Author_Institution
    Intel Corp., Santa Clara, CA, USA
  • fYear
    2011
  • fDate
    5/1/2011 12:00:00 AM
  • Firstpage
    1101
  • Lastpage
    1113
  • Abstract
    X10 is a new object-oriented PGAS (Partitioned Global Address Space) programming language with support for distributed asynchronous dynamic parallelism that goes beyond past SPMD message-passing models such as MPI and SPMD PGAS models such as UPC and Co-Array Fortran. The concurrency constructs in X10 make it possible to express complex computation and communication structures with higher productivity than other distributed-memory programming models. However, this productivity often comes at the cost of high performance overhead when the language is used in its full generality. This paper introduces high-level compiler optimizations and transformations to reduce communication and synchronization overheads in distributed-memory implementations of X10 programs. Specifically, we focus on locality optimizations such as scalar replacement and task localization, combined with supporting transformations such as loop distribution, scalar expansion, loop tiling, and loop splitting. We have completed a prototype implementation of these high-level optimizations, and performed a performance evaluation that shows significant improvements in performance, scalability, communication volume and number of tasks. We evaluated the communication optimizations on three platforms: a 128-node Blue Gene/P cluster, a 32-node Nehalem cluster, and a 16-node Power7 cluster. On the Blue Gene/P cluster, we observed a maximum performance improvement of 31.46x relative to the unoptimized case (for the MolDyn benchmark). On the Nehalem cluster, we observed a maximum performance improvement of 3.01x (for the NQueens benchmark) and on the Power7 cluster, we observed a maximum performance improvement of 2.73x (for the MolDyn benchmark). In addition, there was no case in which the optimized code was slower than the unoptimized case. We also believe that the optimizations presented in this paper will be necessary for any high-productivity PGAS language based on modern object-oriented principles, that is designed for execution on future Extreme Scale systems that place a high premium on locality improvement for performance and energy efficiency.
  • Keywords
    "Optimization","Synchronization","Reactive power","Arrays","Object oriented modeling","Benchmark testing","Electronics packaging"
  • Publisher
    ieee
  • Conference_Titel
    Parallel & Distributed Processing Symposium (IPDPS), 2011 IEEE International
  • ISSN
    1530-2075
  • Print_ISBN
    978-1-61284-372-8
  • Type

    conf

  • DOI
    10.1109/IPDPS.2011.105
  • Filename
    6012917