Fault recovery for distributed shared memory systems

Author

Dieter, William R. ; Lumpp, James E., Jr.

Author_Institution

Dept. of Electr. Eng., Kentucky Univ., Lexington, KY, USA

Volume

2

fYear

1997

fDate

1-8 Feb 1997

Firstpage

525

Abstract

Distributed Shared Memory (DSM) offers programmers a shared memory abstraction on top of an underlying network of distributed memory machines. Advances in network technology and price/performance of workstations suggest that DSM will be the dominant paradigm for future high-performance computing. However, as long running DSM applications scale to hundreds or even thousands of machines, the probability of a node or network link failing increases. Fault tolerance is typically achieved via “checkpointing” techniques that allow applications to “roll back” to a recent checkpoint rather than restarting. High-performance DSM systems using relaxed memory consistency are significantly more difficult to checkpoint than uniprocessor or message passing architectures. This paper describes previous approaches to checkpointing message passing parallel programs along with extensions to DSM systems

Keywords

distributed memory systems; fault tolerant computing; message passing; probability; shared memory systems; DSM systems; classification; distributed shared memory systems; fault recovery; fault tolerance; high-performance computing; message passing architectures; network technology; price/performance of workstations; probability; Checkpointing; Computer networks; Fault tolerance; High performance computing; Large-scale systems; Message passing; Parallel architectures; Programming profession; Scalability; Workstations;

fLanguage

English

Publisher

ieee

Conference_Titel

Aerospace Conference, 1997. Proceedings., IEEE

Conference_Location

Snowmass at Aspen, CO

Print_ISBN

0-7803-3741-7

Type

conf

DOI

10.1109/AERO.1997.577998

Filename

577998