• DocumentCode
    169128
  • Title

    SLURM Support for Remote GPU Virtualization: Implementation and Performance Study

  • Author

    Iserte, Sergio ; Castello, Adrian ; Mayo, Rafael ; Quintana-Orti, Enrique S. ; Silla, Federico ; Duato, Jose ; Reano, Carlos ; Prades, Javier

  • Author_Institution
    Univ. Jaume I de Castello, Castello de la Plana, Spain
  • fYear
    2014
  • fDate
    22-24 Oct. 2014
  • Firstpage
    318
  • Lastpage
    325
  • Abstract
    SLURM is a resource manager that can be leveraged to share a collection of heterogeneous resources among the jobs in execution in a cluster. However, SLURM is not designed to handle resources such as graphics processing units (GPUs). Concretely, although SLURM can use a generic resource plugin (GRes) to manage GPUs, with this solution the hardware accelerators can only be accessed by the job that is in execution on the node to which the GPU is attached. This is a serious constraint for remote GPU virtualization technologies, which aim at providing a user-transparent access to all GPUs in cluster, independently of the specific location of the node where the application is running with respect to the GPU node. In this work we introduce a new type of device in SLURM, "rgpu", in order to gain access from any application node to any GPU node in the cluster using rCUDA as the remote GPU virtualization solution. With this new scheduling mechanism, a user can access any number of GPUs, as SLURM schedules the tasks taking into account all the graphics accelerators available in the complete cluster. We present experimental results that show the benefits of this new approach in terms of increased flexibility for the job scheduler.
  • Keywords
    graphics processing units; parallel architectures; resource allocation; virtualisation; SLURM; generic resource plugin; graphics accelerator; graphics processing unit; hardware accelerator; rCUDA; remote GPU virtualization; resource manager; scheduling mechanism; user-transparent access; Acceleration; Computer architecture; Graphics processing units; Middleware; Resource management; Throughput; Virtualization; HPC cluster; job scheduler; remote GPU virtualization; resource management;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Computer Architecture and High Performance Computing (SBAC-PAD), 2014 IEEE 26th International Symposium on
  • Conference_Location
    Jussieu
  • ISSN
    1550-6533
  • Type

    conf

  • DOI
    10.1109/SBAC-PAD.2014.49
  • Filename
    6970680