• DocumentCode
    3324331
  • Title

    Starfish: fault-tolerant dynamic MPI programs on clusters of workstations

  • Author

    Agbaria, Adnan M. ; Friedman, Roy

  • Author_Institution
    Dept. of Comput. Sci., Technion-Israel Inst. of Technol., Haifa, Israel
  • fYear
    1999
  • fDate
    1999
  • Firstpage
    167
  • Lastpage
    176
  • Abstract
    This paper reports on the architecture and design of Starfish, an environment for executing dynamic (and static) MPI-2 programs on a cluster of workstations. Starfish is unique in being efficient, fault-tolerant, highly available, and dynamic as a system internally, and in supporting fault-tolerance and dynamicity for its application programs as well. Starfish achieves these goals by combining group communication technology with checkpoint/restart, and uses a novel architecture that is both flexible and portable and keeps group communication outside the critical data path, for maximum performance
  • Keywords
    message passing; software architecture; software fault tolerance; software portability; system recovery; workstation clusters; Starfish; application programs; checkpoint; critical data path; dynamic MPI programs; fault-tolerant programs; group communication technology; maximum performance; restart; software architecture; workstation clusters; Bandwidth; Communications technology; Computer architecture; Computer networks; Computer science; Concurrent computing; Fault tolerance; Operating systems; Portable computers; Workstations;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    High Performance Distributed Computing, 1999. Proceedings. The Eighth International Symposium on
  • Conference_Location
    Redondo Beach, CA
  • ISSN
    1082-8907
  • Print_ISBN
    0-7803-5681-0
  • Type

    conf

  • DOI
    10.1109/HPDC.1999.805295
  • Filename
    805295