Title :
Comprehensive Resource Use Monitoring for HPC Systems with TACC Stats
Author :
Evans, Todd ; Barth, William L. ; Browne, James C. ; DeLeon, Robert L. ; Furlani, Thomas R. ; Gallo, Steven M. ; Jones, Matthew D. ; Patra, Abani K.
Author_Institution :
Texas Adv. Comput. Center, TX, USA
Abstract :
This paper reports on a comprehensive, fully automated resource use monitoring package, TACC Stats, which enables both consultants, users and other stakeholders in an HPC system to systematically and actively identify jobs/applications that could benefit from expert support and to aid in the diagnosis of software and hardware issues. TACC Stats continuously collects and analyzes resource usage data for every job run on a system and differs significantly from conventional profilers because it requires no action on the part of the user or consultants -- it is always collecting data on every node for every job. TACC Stats is open source and downloadable, configurable and compatible with general Linux-based computing platforms, and extensible to new CPU architectures and hardware devices. It is meant to provide a comprehensive resource usage monitoring solution. In addition to describing TACC Stats, the paper illustrates its application to identifying production jobs which have inefficient resource use characteristics.
Keywords :
parallel processing; resource allocation; software packages; system monitoring; CPU architectures; HPC system; TACC Stats; comprehensive resource usage monitoring solution; expert support; fully automated resource use monitoring package; general Linux-based computing platforms; hardware devices; hardware issues diagnosis; high performance computing; jobs/applications identification; production jobs; profilers; resource usage data analysis; resource usage data collection; software issues diagnosis; Bandwidth; Hardware; Measurement; Monitoring; Radiation detectors; Sockets; Standards;
Conference_Titel :
HPC User Support Tools (HUST), 2014 First International Workshop on
DOI :
10.1109/HUST.2014.7