• DocumentCode
    692924
  • Title

    Enabling comprehensive data-driven system management for large computational facilities

  • Author

    Browne, James C. ; DeLeon, Robert L. ; Charng-Da Lu ; Jones, M.D. ; Gallo, Steven M. ; Ghadersohi, Amin ; Patra, Abani K. ; Barth, William L. ; Hammond, John ; Furlani, Thomas R. ; McLay, Robert T.

  • Author_Institution
    Center for Comput. Res., SUNY at Buffalo, Buffalo, NY, USA
  • fYear
    2013
  • fDate
    17-22 Nov. 2013
  • Firstpage
    1
  • Lastpage
    11
  • Abstract
    This paper presents a tool chain, based on the open source tool TACC_Stats, for systematic and comprehensive job level resource use measurement for large cluster computers, and its incorporation into XDMoD, a reporting and analytics framework for resource management that targets meeting the information needs of users, application developers, systems administrators, systems management and funding managers. Accounting, scheduler and event logs are integrated with system performance data from TACC_Stats. TACC_Stats periodically records resource use including many hardware counters for each job running on each node. Furthermore, system level metrics are obtained through aggregation of the node (job) level data. Analysis of this data generates many types of standard and custom reports and even a limited predictive capability that has not previously been available for open-source, Linux-based software systems. This paper presents case studies of information that can be applied for effective resource management. We believe this system to be the first fully comprehensive system for supporting the information needs of all stakeholders in open-source software based HPC systems.
  • Keywords
    Linux; public domain software; resource allocation; workstation clusters; Linux-based software systems; TACC_Stats; XDMoD; cluster computers; computational facilities; data-driven system management; funding managers; job level resource; open source tool; open-source software based HPC systems; resource management; system administrators; system level metrics; system management; systematic resource; tool chain; Abstracts; Bandwidth; Market research; Performance evaluation; Servers; Sockets; Standards;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    High Performance Computing, Networking, Storage and Analysis (SC), 2013 International Conference for
  • Conference_Location
    Denver, CO
  • Print_ISBN
    978-1-4503-2378-9
  • Type

    conf

  • DOI
    10.1145/2503210.2503230
  • Filename
    6877519