Author_Institution :
Renaissance Comput. Inst., North Carolina Univ., Chapel Hill, NC, USA
Abstract :
Summary form only given. Systems built from commodity processors dominate high-performance computing, with systems containing thousands of processors now being deployed. As node counts for multi-teraflop systems grow to tens of thousands, with proposed petaflop system likely to contain hundreds of thousands of nodes, and with a tsunami of new experimental and computational. The mean time before failure (MTBF) for the individual components (i.e., processors, disks, memories, power supplies, fans and networks) is high. In contrast to parallel systems, distributed software for networks, whether transport protocols or Web/Grid services, are designed to be resilient to component failures. Our thesis is that these "two worlds" of software-distributed systems and parallel systems. In this paper, we describe possible approaches for the design and effective use of large-scale systems. The approaches range from intelligent hardware monitoring and adaptation, through low-overhead recovery schemes, statistical sampling and differential scheduling and to alternative models of system software, including evolutionary adaptation.
Keywords :
Internet; grid computing; large-scale systems; parallel processing; processor scheduling; sampling methods; transport protocols; MTBF; Web-grid service; differential scheduling; distributed software; evolutionary adaptation; high-performance computing; intelligent hardware monitoring; large-scale system; low-overhead recovery scheme; mean time before failure; multiteraflop system; parallel system; processor commodity; statistical sampling; transport protocol; tsunami experimental; Condition monitoring; Fans; Hardware; Large-scale systems; Power supplies; Sampling methods; Software systems; System software; Transport protocols; Tsunami;
Conference_Titel :
Modeling, Analysis, and Simulation of Computer and Telecommunication Systems, 2005. 13th IEEE International Symposium on