Title :
Low overhead high performance runtime monitoring of collective communication
Author :
Bongo, Lars Ailo ; Anshus, Otto J. ; Bjoerndalen, J.M.
Author_Institution :
Dept. of Comput. Sci., Tromso Univ., Norway
Abstract :
Scalability of parallel applications on clusters and multi-clusters is often limited by communication performance. Message tracing can provide data for understanding bottlenecks, and for performance tuning. However, it requires collecting, storing, analyzing, and transferring potentially gigabytes of data. We have designed the EventSpace system for low overhead and high performance runtime collective communication trace analysis. EventSpace separates the perturbation and performance requirements of data collection, analysis, gathering sand visualization. Data collection overhead is low since the minimum amount of data is recorded and stored temporarily in main memory. The recorded data is either discarded or analyzed on demand using available cluster resources. Analysis is distributed for high performance, and coscheduled with the computation and communication system threads for low perturbation. Gathering of analyzed data is done using extensible collective communication operations, which can be tuned to trade off between performance and monitoring overhead. EventSpace was used to do run-time monitoring and analysis of collective communication micro-benchmarks run on clusters, multi-clusters, and multi-clusters with emulated WAN links. Performance data was collected, analyzed and gathered with 0-3% monitoring overhead.
Keywords :
computer networks; message passing; parallel programming; system monitoring; EventSpace system; collective communication trace analysis; message tracing; run-time monitoring; Data analysis; Data visualization; Distributed computing; High performance computing; Monitoring; Performance analysis; Runtime; Scalability; Wide area networks; Yarn;
Conference_Titel :
Parallel Processing, 2005. ICPP 2005. International Conference on
Print_ISBN :
0-7695-2380-3
DOI :
10.1109/ICPP.2005.50