• DocumentCode
    2535407
  • Title

    Identifying the Root Causes of Wait States in Large-Scale Parallel Applications

  • Author

    Bohme, David ; Geimer, Markus ; Wolf, Felix ; Arnold, Lukas

  • Author_Institution
    Julich Supercomput. Centre, Jülich, Germany
  • fYear
    2010
  • fDate
    13-16 Sept. 2010
  • Firstpage
    90
  • Lastpage
    100
  • Abstract
    Driven by growing application requirements and accelerated by current trends in microprocessor design, the number of processor cores on modern supercomputers is increasing from generation to generation. However, load or communication imbalance prevents many codes from taking advantage of the available parallelism, as delays of single processes may spread wait states across the entire machine. Moreover, when employing complex point-to-point communication patterns, wait states may propagate along far-reaching cause-effect chains that are hard to track manually and that complicate an assessment of the actual costs of an imbalance. Building on earlier work by Meira Jr. et al., we present a scalable approach that identifies program wait states and attributes their costs in terms of resource waste to their original cause. By replaying event traces in parallel both in forward and backward direction, we can identify the processes and call paths responsible for the most severe imbalances even for runs with tens of thousands of processes.
  • Keywords
    integrated circuit design; microprocessor chips; parallel machines; cause-effect chains; large-scale parallel applications; microprocessor design; parallel machine; point-to-point communication patterns; supercomputers; Clocks; Context; Delay; Optimization; Runtime; Scalability; Synchronization; parallel program performance analysis; root cause analysis;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Parallel Processing (ICPP), 2010 39th International Conference on
  • Conference_Location
    San Diego, CA
  • ISSN
    0190-3918
  • Print_ISBN
    978-1-4244-7913-9
  • Electronic_ISBN
    0190-3918
  • Type

    conf

  • DOI
    10.1109/ICPP.2010.18
  • Filename
    5599153