DocumentCode :
2166504
Title :
Raptor: integrating checkpoints and thread migration for cluster management
Author :
Shafi, Hazim ; Speight, Evan ; Bennett, John K.
Author_Institution :
Austin Res. Lab., IBM Res., Austin, TX, USA
fYear :
2003
fDate :
6-18 Oct. 2003
Firstpage :
141
Lastpage :
150
Abstract :
Software distributed shared-memory (SDSM) provides the abstraction necessary to run shared-memory applications on cost-effective parallel platforms such as clusters of workstations. However, problems such as cluster component reliability and cluster management, which are not directly related to performance, need to be addressed before SDSM solutions can be widely adopted. This paper presents Raptor, an SDSM cluster management system based on checkpoint/recovery and thread migration. Raptor checkpoints decouple the runtime system and application data from application threads, allowing efficient load balancing, resource allocation, and rollback recovery. There are two important features of the system. First, it reduces checkpoint overhead by only saving application-specific data that cannot be recreated at recovery time. Second, by integrating thread migration capability both at running and recovery, it allows the addition or removal of computing resources from a running application, while adding little or no additional burden on the SDSM application programmer.
Keywords :
distributed programming; distributed shared memory systems; resource allocation; system recovery; workstation clusters; Raptor; SDSM; checkpoint integration; cluster component reliability; cluster management; load balancing; resource allocation; rollback recovery; shared-memory applications; software distributed shared-memory; thread migration; workstations clusters; Coherence; Packaging; Personal communication networks; Programming environments; Programming profession; Registers; Resource management; Waste management; Windows; Yarn;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Reliable Distributed Systems, 2003. Proceedings. 22nd International Symposium on
ISSN :
1060-9857
Print_ISBN :
0-7695-1955-5
Type :
conf
DOI :
10.1109/RELDIS.2003.1238063
Filename :
1238063
Link To Document :
بازگشت