Title :
Starfish: fault-tolerant dynamic MPI programs on clusters of workstations
Author :
Agbaria, Adnan M. ; Friedman, Roy
Author_Institution :
Dept. of Comput. Sci., Technion-Israel Inst. of Technol., Haifa, Israel
Abstract :
This paper reports on the architecture and design of Starfish, an environment for executing dynamic (and static) MPI-2 programs on a cluster of workstations. Starfish is unique in being efficient, fault-tolerant, highly available, and dynamic as a system internally, and in supporting fault-tolerance and dynamicity for its application programs as well. Starfish achieves these goals by combining group communication technology with checkpoint/restart, and uses a novel architecture that is both flexible and portable and keeps group communication outside the critical data path, for maximum performance
Keywords :
message passing; software architecture; software fault tolerance; software portability; system recovery; workstation clusters; Starfish; application programs; checkpoint; critical data path; dynamic MPI programs; fault-tolerant programs; group communication technology; maximum performance; restart; software architecture; workstation clusters; Bandwidth; Communications technology; Computer architecture; Computer networks; Computer science; Concurrent computing; Fault tolerance; Operating systems; Portable computers; Workstations;
Conference_Titel :
High Performance Distributed Computing, 1999. Proceedings. The Eighth International Symposium on
Conference_Location :
Redondo Beach, CA
Print_ISBN :
0-7803-5681-0
DOI :
10.1109/HPDC.1999.805295