Title :
Tolerating node failures in cache only memory architectures
Author :
Gefflaut, A. ; Morin, C. ; Banâtre, M.
Author_Institution :
IRISA, Rennes, France
Abstract :
COMAs (cache only memory architectures) are an interesting class of large scale shared memory multiprocessors. They extend the concepts of cache memories and shared virtual memory by using the local memories of the nodes as large caches for a single shared address space. Due to their large number of components, these architectures are particularly susceptible to hardware failures and so fault tolerance mechanisms have to be introduced to ensure a high availability. We propose an implementation of backward error recovery in a COMA which minimizes performance degradation and requires little hardware modifications. This implementation uses the features of a COMA to implement a stable storage abstraction using the standard memories of the architecture. Recovery data are replicated and mixed with current data in node memories both of which are managed in a transparent way using an extended coherence protocol
Keywords :
cache storage; fault tolerant computing; memory architecture; shared memory systems; system recovery; COMA; backward error recovery; cache memories; cache only memory architectures; extended coherence protocol; fault tolerance mechanisms; hardware failures; large scale shared memory multiprocessors; local memories; node failure tolerance; node memories; performance degradation; recovery data; shared virtual memory; single shared address space; stable storage abstraction; Computer architecture; Degradation; Fault detection; Fault tolerance; Hardware; Large-scale systems; Memory architecture; Memory management; Multiprocessor interconnection networks; Protocols;
Conference_Titel :
Supercomputing '94., Proceedings
Conference_Location :
Washington, DC
Print_ISBN :
0-8186-6605-6
DOI :
10.1109/SUPERC.1994.344300