• DocumentCode
    3199064
  • Title

    Identifying the Culprits Behind Network Congestion

  • Author

    Bhatele, Abhinav ; Titus, Andrew R. ; Thiagarajan, Jayaraman J. ; Jain, Nikhil ; Gamblin, Todd ; Bremer, Peer-Timo ; Schulz, Martin ; Kale, Laxmikant V.

  • Author_Institution
    Center for Appl. Sci. Comput., Lawrence Livermore Nat. Lab., Livermore, CA, USA
  • fYear
    2015
  • fDate
    25-29 May 2015
  • Firstpage
    113
  • Lastpage
    122
  • Abstract
    Network congestion is one of the primary causes of performance degradation, performance variability and poor scaling in communication-heavy parallel applications. However, the causes and mechanisms of network congestion on modern interconnection networks are not well understood. We need new approaches to analyze, model and predict this critical behaviour in order to improve the performance of large-scale parallel applications. This paper applies supervised learning algorithms, such as forests of extremely randomized trees and gradient boosted regression trees, to perform regression analysis on communication data and application execution time. Using data derived from multiple executions, we create models to predict the execution time of communication-heavy parallel applications. This analysis also identifies the features and associated hardware components that have the most impact on network congestion and intern, on execution time. The ideas presented in this paper have wide applicability: predicting the execution time on a different number of nodes, or different input datasets, or even for an unknown code, identifying the best configuration parameters for an application, and finding the root causes of network congestion on different architectures.
  • Keywords
    learning (artificial intelligence); parallel processing; regression analysis; trees (mathematics); communication-heavy parallel application; extremely randomized tree; gradient boosted regression tree; network congestion; regression analysis; supervised learning algorithm; Data models; Hardware; Prediction algorithms; Predictive models; Regression tree analysis; Three-dimensional displays; Vegetation; congestion; interconnection network; machine learning; modeling; performance prediction; root cause;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Parallel and Distributed Processing Symposium (IPDPS), 2015 IEEE International
  • Conference_Location
    Hyderabad
  • ISSN
    1530-2075
  • Type

    conf

  • DOI
    10.1109/IPDPS.2015.92
  • Filename
    7161501