DocumentCode :
308580
Title :
Fault recovery for distributed shared memory systems
Author :
Dieter, William R. ; Lumpp, James E., Jr.
Author_Institution :
Dept. of Electr. Eng., Kentucky Univ., Lexington, KY, USA
Volume :
2
fYear :
1997
fDate :
1-8 Feb 1997
Firstpage :
525
Abstract :
Distributed Shared Memory (DSM) offers programmers a shared memory abstraction on top of an underlying network of distributed memory machines. Advances in network technology and price/performance of workstations suggest that DSM will be the dominant paradigm for future high-performance computing. However, as long running DSM applications scale to hundreds or even thousands of machines, the probability of a node or network link failing increases. Fault tolerance is typically achieved via “checkpointing” techniques that allow applications to “roll back” to a recent checkpoint rather than restarting. High-performance DSM systems using relaxed memory consistency are significantly more difficult to checkpoint than uniprocessor or message passing architectures. This paper describes previous approaches to checkpointing message passing parallel programs along with extensions to DSM systems
Keywords :
distributed memory systems; fault tolerant computing; message passing; probability; shared memory systems; DSM systems; classification; distributed shared memory systems; fault recovery; fault tolerance; high-performance computing; message passing architectures; network technology; price/performance of workstations; probability; Checkpointing; Computer networks; Fault tolerance; High performance computing; Large-scale systems; Message passing; Parallel architectures; Programming profession; Scalability; Workstations;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Aerospace Conference, 1997. Proceedings., IEEE
Conference_Location :
Snowmass at Aspen, CO
Print_ISBN :
0-7803-3741-7
Type :
conf
DOI :
10.1109/AERO.1997.577998
Filename :
577998
Link To Document :
بازگشت