• DocumentCode
    802200
  • Title

    STORM: Scalable Resource Management for Large-Scale Parallel Computers

  • Author

    Frachtenberg, Eitan ; Petrini, Fabrizio ; Fernández, Juan ; Pakin, Scott

  • Author_Institution
    CCS-3 Modeling, Algorithms, & Informatics Group, Los Alamos Nat. Lab., NM
  • Volume
    55
  • Issue
    12
  • fYear
    2006
  • Firstpage
    1572
  • Lastpage
    1587
  • Abstract
    Although clusters are a popular form of high-performance computing, they remain more difficult to manage than sequential systems - or even symmetric multiprocessors. In this paper, we identify a small set of primitive mechanisms that are sufficiently general to be used as building blocks to solve a variety of resource-management problems. We then present STORM, a resource-management environment that embodies these mechanisms in a scalable, low-overhead, and efficient implementation. The key innovation behind STORM is a modular software architecture that reduces all resource management functionality to a small number of highly scalable mechanisms. These mechanisms simplify the integration of resource management with low-level network features. As a result of this design, STORM can launch large, parallel applications an order of magnitude faster than the best time reported in the literature and can gang-schedule a parallel application as fast as the node OS can schedule a sequential application. This paper describes the mechanisms and algorithms behind STORM and presents a detailed performance model that shows that STORM´s performance can scale to thousands of nodes
  • Keywords
    computer network management; network operating systems; parallel machines; processor scheduling; resource allocation; software architecture; workstation clusters; cluster computing; high-performance computing; large-scale parallel computers; modular software architecture; network operating system; node OS; parallel application gang-scheduling; performance model; scalable resource management environment; sequential application scheduling; sequential system management; symmetric multiprocessor management; Algorithm design and analysis; Application software; Concurrent computing; Laboratories; Large-scale systems; Multicast algorithms; Resource management; Scheduling algorithm; Storms; Technological innovation; Hardware/software interface; and modeling; integration; network operating systems; supercomputers.; system architectures;
  • fLanguage
    English
  • Journal_Title
    Computers, IEEE Transactions on
  • Publisher
    ieee
  • ISSN
    0018-9340
  • Type

    jour

  • DOI
    10.1109/TC.2006.206
  • Filename
    1717389