Title :
Using logging and asynchronous checkpointing to implement recoverable distributed shared memory
Author :
Richard, Golden G., III ; Singhal, Mukesh
Author_Institution :
Dept. of Comput. & Inf. Sci., Ohio State Univ., Columbus, OH, USA
Abstract :
Distributed shared memory provides a useful paradigm for developing distributed applications. As the number of processors in the system and running time of distributed applications increase, the likelihood of processor failure increases. A method of recovering processes running in a distributed shared memory environment which minimizes lost work and the cost of recovery is desirable so that long-running applications are not adversely affected by processor failure. A technique for achieving recoverable distributed shared memory which utilizes asynchronous process checkpoints and logging of pages accessed via read operations on the shared address space is presented. The scheme supports independent process recovery without forcing rollback of operational processes during recovery. The method is particularly useful in environments where taking process checkpoints is expensive
Keywords :
distributed memory systems; fault tolerant computing; reliability; shared memory systems; system recovery; asynchronous checkpointing; asynchronous process checkpoints; distributed applications; distributed shared memory environment; independent process recovery; long-running applications; operational processes; process checkpoints; processor failure; read operations; recoverable distributed shared memory; running time; shared address space; Application software; Checkpointing; Computer crashes; Costs; Data structures; Distributed computing; Information science; Load management; Software systems; Trademarks;
Conference_Titel :
Reliable Distributed Systems, 1993. Proceedings., 12th Symposium on
Conference_Location :
Princeton, NJ
Print_ISBN :
0-8186-4310-2
DOI :
10.1109/RELDIS.1993.393473