• DocumentCode
    2862847
  • Title

    Efficient tracing and performance analysis for large distributed systems

  • Author

    Anderson, Eric ; Hoover, Christopher ; Li, Xiaozhou ; Tucek, Joseph

  • Author_Institution
    Hewlett-Packard Labs., Palo Alto, CA, USA
  • fYear
    2009
  • fDate
    21-23 Sept. 2009
  • Firstpage
    1
  • Lastpage
    10
  • Abstract
    Distributed systems are notoriously difficult to implement and debug. One important tool for understanding the behavior of distributed systems is tracing. Unfortunately, effective tracing for modern distributed systems faces several challenges. First, many interesting behaviors in distributed systems only occur rarely, or at full production scale. Hence we need tracing mechanisms which impose minimal overhead, in order to allow always-on tracing of production instances. Second, for high-speed systems, messages can be delivered in significantly less time than the error of traditional time synchronization techniques such as network time protocol (NTP), necessitating time adjustment techniques with much higher precision. Third, distributed systems today may generate millions of events per second systemwide, resulting in traces consisting of billions of events. Such large traces can overwhelm existing trace analysis tools. These challenges make effective tracing difficult. We present techniques that address these three challenges. Our contributions include (1) a low-overhead tracing mechanism, which allows tracing of large systems without impacting their behavior or performance (0.14 ¿s/event), (2) a post hoc technique for producing highly accurate time synchronization across hosts (within 10/ts, compared to between 100 ¿s to 2 ms for NTP), and (3) incremental data processing techniques which facilitate analyzing traces containing billions of trace points on desktop systems. We have successfully applied these techniques to two distributed systems, a cooperative caching system and a distributed storage system, and from our experience, we believe our techniques are applicable to other distributed systems.
  • Keywords
    distributed processing; program debugging; program diagnostics; synchronisation; cooperative caching system; desktop systems; distributed storage system; incremental data processing techniques; large distributed systems; low-overhead tracing mechanism; network time protocol; post hoc technique; time adjustment techniques; time synchronization; trace analysis tools; Clocks; Cooperative caching; Debugging; Distortion measurement; Laboratories; Monitoring; Performance analysis; Production systems; Programming profession; Synchronization;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Modeling, Analysis & Simulation of Computer and Telecommunication Systems, 2009. MASCOTS '09. IEEE International Symposium on
  • Conference_Location
    London
  • ISSN
    1526-7539
  • Print_ISBN
    978-1-4244-4927-9
  • Electronic_ISBN
    1526-7539
  • Type

    conf

  • DOI
    10.1109/MASCOT.2009.5366158
  • Filename
    5366158