• DocumentCode
    1665761
  • Title

    Clusters: challenges and opportunities

  • Author

    Reed, Daniel A.

  • Author_Institution
    Nat. Center for Supercomput. Applications, Illinois Univ., Urbana, IL, USA
  • fYear
    2003
  • Abstract
    Summary form only given. The continuum of cluster computing continues to expand, with terascale clusters now in production and petascale clusters in design. How do we manage clusters with tens of thousands of nodes, each with power, communication, processing, and memory constraints? How do we design, package, and support systems with hundreds of thousands of processors in a reliable way? This paper discusses the approaches to computing and communication fault-tolerance and reliability for large-scale clusters. It also sketches some of the technical challenges and opportunities in deploying and supporting large-scale clusters, highlighted by recent developments at NCSA and the U.S. TeraGrid and their application to emerging scientific applications.
  • Keywords
    performance evaluation; workstation clusters; cluster computing; communication fault tolerance; petascale clusters; reliability; terascale clusters; Energy management; Fault tolerance; Large-scale systems; Memory management; Packaging; Petascale computing; Power system management; Power system reliability; Production; USA Councils;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Parallel and Distributed Processing Symposium, 2003. Proceedings. International
  • ISSN
    1530-2075
  • Print_ISBN
    0-7695-1926-1
  • Type

    conf

  • DOI
    10.1109/IPDPS.2003.1213359
  • Filename
    1213359