• DocumentCode
    3013408
  • Title

    Multiprocessor architecture using an audit trail for fault tolerance

  • Author

    Sunada, D. ; Glasco, D. ; Flynn, M.

  • Author_Institution
    Lab. of Comput. Syst., Stanford Univ., CA, USA
  • fYear
    1999
  • fDate
    15-18 June 1999
  • Firstpage
    40
  • Lastpage
    47
  • Abstract
    In order to deploy a tightly-coupled multiprocessor (TCMP) in the commercial world, the TCMP must be fault tolerant. Researchers have designed various checkpointing algorithms to implement fault tolerance in a TCMP. To date, these algorithms fall into 2 principal classes, where processors can be checkpoint dependent on each other. We introduce a new apparatus and algorithm that represents a 3rd class of checkpointing scheme. Our algorithm is distributed recoverable shared memory with logs (DRSM-L) and is the first of its kind for TCMPs. DRSM-L has the desirable property that a processor can establish a checkpoint or roll back to the last checkpoint in a manner that is independent of any other processor. In this paper, we describe DRSM-L and present results indicating its performance.
  • Keywords
    distributed shared memory systems; fault tolerant computing; audit trail; checkpointing scheme; distributed recoverable shared memory with logs; fault tolerance; fault tolerant; tightly-coupled multiprocessor; Algorithm design and analysis; Business; Checkpointing; Computer architecture; Fault tolerance; Fault tolerant systems; Hardware;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Fault-Tolerant Computing, 1999. Digest of Papers. Twenty-Ninth Annual International Symposium on
  • Conference_Location
    Madison, WI, USA
  • ISSN
    0731-3071
  • Print_ISBN
    0-7695-0213-X
  • Type

    conf

  • DOI
    10.1109/FTCS.1999.781032
  • Filename
    781032