• DocumentCode
    3585221
  • Title

    Comprehensive Resource Use Monitoring for HPC Systems with TACC Stats

  • Author

    Evans, Todd ; Barth, William L. ; Browne, James C. ; DeLeon, Robert L. ; Furlani, Thomas R. ; Gallo, Steven M. ; Jones, Matthew D. ; Patra, Abani K.

  • Author_Institution
    Texas Adv. Comput. Center, TX, USA
  • fYear
    2014
  • Firstpage
    13
  • Lastpage
    21
  • Abstract
    This paper reports on a comprehensive, fully automated resource use monitoring package, TACC Stats, which enables both consultants, users and other stakeholders in an HPC system to systematically and actively identify jobs/applications that could benefit from expert support and to aid in the diagnosis of software and hardware issues. TACC Stats continuously collects and analyzes resource usage data for every job run on a system and differs significantly from conventional profilers because it requires no action on the part of the user or consultants -- it is always collecting data on every node for every job. TACC Stats is open source and downloadable, configurable and compatible with general Linux-based computing platforms, and extensible to new CPU architectures and hardware devices. It is meant to provide a comprehensive resource usage monitoring solution. In addition to describing TACC Stats, the paper illustrates its application to identifying production jobs which have inefficient resource use characteristics.
  • Keywords
    parallel processing; resource allocation; software packages; system monitoring; CPU architectures; HPC system; TACC Stats; comprehensive resource usage monitoring solution; expert support; fully automated resource use monitoring package; general Linux-based computing platforms; hardware devices; hardware issues diagnosis; high performance computing; jobs/applications identification; production jobs; profilers; resource usage data analysis; resource usage data collection; software issues diagnosis; Bandwidth; Hardware; Measurement; Monitoring; Radiation detectors; Sockets; Standards;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    HPC User Support Tools (HUST), 2014 First International Workshop on
  • Type

    conf

  • DOI
    10.1109/HUST.2014.7
  • Filename
    7081222