DocumentCode :
1095816
Title :
Lessons from FTM: an experiment in design and implementation of a low-cost fault tolerant system
Author :
Muller, Gilles ; Banatre, Michel ; Peyrouze, Nadine ; Rochat, Bruno
Author_Institution :
IRISA/INRIA, Rennes, France
Volume :
45
Issue :
2
fYear :
1996
fDate :
6/1/1996 12:00:00 AM
Firstpage :
332
Lastpage :
340
Abstract :
This paper describes an experiment in the design of a general purpose fault tolerant system, FTM. The main objective of the FTM design was to implement a low-cost fault-tolerant system that could be used on standard workstations. At the operating system level, the authors´ goal was to offer fault-tolerance transparency to user applications. In other words, porting an application to FTM need only require compiling the source code without having to modify it. These objectives were achieved using the Mach micro-kernel and a modular set of reliable servers which implement application checkpoints and provide continuous system functions despite machine crashes. At the architectural level, their approach relies on a high-performance stable storage implementation, called stable transactional memory (STM), which can be implemented either by hardware or software. The authors first motivate their design choices, then detail the FTM implementation at both architectural and operating system level. They discuss the reasons for the evolution of their stable memory technology from hardware to software. They evaluate the performance of the FTM prototype. They conclude with lessons learned and give some assessments
Keywords :
fault tolerant computing; microcomputers; operating system kernels; reliability; transaction processing; workstations; Mach micro-kernel; fault tolerant microprocessor; general purpose fault tolerant system; operating system; reliability; servers; source code compiling; stable transactional memory; standard workstations; Application software; Computer architecture; Computer crashes; Continuous time systems; Costs; Fault tolerance; Fault tolerant systems; Hardware; Local area networks; Operating systems;
fLanguage :
English
Journal_Title :
Reliability, IEEE Transactions on
Publisher :
ieee
ISSN :
0018-9529
Type :
jour
DOI :
10.1109/24.510822
Filename :
510822
Link To Document :
بازگشت