Title :
STORM: Scalable Resource Management for Large-Scale Parallel Computers
Author :
Frachtenberg, Eitan ; Petrini, Fabrizio ; Fernández, Juan ; Pakin, Scott
Author_Institution :
CCS-3 Modeling, Algorithms, & Informatics Group, Los Alamos Nat. Lab., NM
Abstract :
Although clusters are a popular form of high-performance computing, they remain more difficult to manage than sequential systems - or even symmetric multiprocessors. In this paper, we identify a small set of primitive mechanisms that are sufficiently general to be used as building blocks to solve a variety of resource-management problems. We then present STORM, a resource-management environment that embodies these mechanisms in a scalable, low-overhead, and efficient implementation. The key innovation behind STORM is a modular software architecture that reduces all resource management functionality to a small number of highly scalable mechanisms. These mechanisms simplify the integration of resource management with low-level network features. As a result of this design, STORM can launch large, parallel applications an order of magnitude faster than the best time reported in the literature and can gang-schedule a parallel application as fast as the node OS can schedule a sequential application. This paper describes the mechanisms and algorithms behind STORM and presents a detailed performance model that shows that STORM´s performance can scale to thousands of nodes
Keywords :
computer network management; network operating systems; parallel machines; processor scheduling; resource allocation; software architecture; workstation clusters; cluster computing; high-performance computing; large-scale parallel computers; modular software architecture; network operating system; node OS; parallel application gang-scheduling; performance model; scalable resource management environment; sequential application scheduling; sequential system management; symmetric multiprocessor management; Algorithm design and analysis; Application software; Concurrent computing; Laboratories; Large-scale systems; Multicast algorithms; Resource management; Scheduling algorithm; Storms; Technological innovation; Hardware/software interface; and modeling; integration; network operating systems; supercomputers.; system architectures;
Journal_Title :
Computers, IEEE Transactions on
DOI :
10.1109/TC.2006.206