• DocumentCode
    796979
  • Title

    Lightweight Online Performance Monitoring and Tuning with Embedded Gossip

  • Author

    Zhu, J.W. ; Bridges, P.G. ; Maccabe, A.

  • Author_Institution
    Xilinx, Inc., Albuquerque, NM
  • Volume
    20
  • Issue
    7
  • fYear
    2009
  • fDate
    7/1/2009 12:00:00 AM
  • Firstpage
    1038
  • Lastpage
    1049
  • Abstract
    Understanding and tuning the performance of large-scale long-running applications is difficult, with both standard trace-based and statistical methods having substantial shortcomings that limit their usefulness. This paper describes a new performance monitoring approach called Embedded Gossip (EG) designed to enable lightweight online performance monitoring and tuning. EG works by piggybacking performance information on existing messages and performing information correlation online, giving each process in a parallel application a weakly consistent global view of the behavior of the entire application. To demonstrate the viability of EG, this paper presents the design and experimental evaluation of two different online monitoring systems and an online global adaptation system driven by Embedded Gossiping. In addition, we present a metric system for evaluating the suitability of an application to EG-based monitoring and adaptation, a general architecture for implementing EG-based monitoring systems, and a modified global commit algorithm appropriate for use in EG-based global adaptation systems. Together, these results demonstrate that EG is an efficient low-overhead approach for addressing a wide range of parallel performance monitoring tasks and that results from these systems can effectively drive online global adaptation.
  • Keywords
    software performance evaluation; system monitoring; EG-based monitoring systems; embedded gossip; global commit algorithm; information correlation; large-scale long-running application; online global adaptation system; online performance monitoring; online performance tuning; standard trace-based method; statistical method; Bridges; Debugging; Information analysis; Large-scale systems; Merging; Monitoring; Performance analysis; Protocols; Statistical analysis; System performance; Lightweight performance monitoring; Measurements; Parallel systems; Support for Adaptation; dynamic performance tuning; parallel systems.; support for adaptation;
  • fLanguage
    English
  • Journal_Title
    Parallel and Distributed Systems, IEEE Transactions on
  • Publisher
    ieee
  • ISSN
    1045-9219
  • Type

    jour

  • DOI
    10.1109/TPDS.2008.126
  • Filename
    4564447