Title :
Checkpoint/rollback in a distributed system using coarse-grained dataflow
Author :
Cummings, D. ; Alkalaj, L.
Author_Institution :
Jet Propulsion Lab., California Inst. of Technol., Pasadena, CA, USA
Abstract :
The Common Spaceborne Multicomputer Operating System (COSMOS) is a spacecraft operating system for distributed memory multiprocessors, designed to meet the on-board computing requirements of long-life interplanetary missions. One of the main features of COSMOS is software-implemented fault-tolerance, including 2-way voting, 3-way voting, and check point/rollback. This paper describes the COSMOS distributed checkpoint/rollback approach, which exploits the fact that a COSMOS application program is based on a coarse-grained dataflow programming paradigm and therefore most of the state of a distributed application program is contained in the data tokens. Furthermore, all computers maintain a consistent view of this dynamic state, which facilitates the implementation of a coordinated checkpoint.<>
Keywords :
aerospace computing; concurrency control; distributed memory systems; fault tolerant computing; operating systems (computers); parallel processing; software reliability; space vehicles; 2-way voting; 3-way voting; COSMOS; Common Spaceborne Multicomputer Operating System; checkpoint; coarse-grained dataflow; coarse-grained dataflow programming paradigm; coordinated checkpoint; data tokens; distributed application program; distributed memory multiprocessors; distributed system; long-life interplanetary missions; on-board computing requirements; rollback; software-implemented fault-tolerance; spacecraft operating system; Distributed computing; Fault tolerance; Operating systems; Orbital robotics; Propulsion; Real time systems; Robot kinematics; Space technology; Space vehicles; Voting;
Conference_Titel :
Fault-Tolerant Computing, 1994. FTCS-24. Digest of Papers., Twenty-Fourth International Symposium on
Conference_Location :
Austin, TX, USA
Print_ISBN :
0-8186-5520-8
DOI :
10.1109/FTCS.1994.315619