DocumentCode
1455223
Title
Computing in the RAIN: a reliable array of independent nodes
Author
Bohossian, Vasken ; Fan, Chenggong C. ; LeMahieu, Paul S. ; Riedel, Marc D. ; Xu, Lihao ; Bruck, Jehoshua
Author_Institution
Rainfinity, Pasadena, CA, USA
Volume
12
Issue
2
fYear
2001
fDate
2/1/2001 12:00:00 AM
Firstpage
99
Lastpage
114
Abstract
The RAIN project is a research collaboration between Caltech and NASA-JPL on distributed computing and data-storage systems for future spaceborne missions. The goal of the project is to identify and develop key building blocks for reliable distributed systems built with inexpensive off-the-shelf components. The RAIN platform consists of a heterogeneous cluster of computing and/or storage nodes connected via multiple interfaces to networks configured in fault-tolerant topologies. The RAIN software components run in conjunction with operating system services and standard network protocols. Through software-implemented fault tolerance, the system tolerates multiple node, link, and switch failures, with no single point of failure. The RAIN-technology has been transferred to Rainfinity, a start-up company focusing on creating clustered solutions for improving the performance and availability of Internet data centers. In this paper, we describe the following contributions: 1) fault-tolerant interconnect topologies and communication protocols providing consistent error reporting of link failures, 2) fault management techniques based on group membership, and 3) data storage schemes based on computationally efficient error-control codes. We present several proof-of-concept applications: a highly-available video server, a highly-available Web server, and a distributed checkpointing system. Also, we describe a commercial product, Rainwall, built with the RAIN technology
Keywords
distributed processing; fault tolerant computing; protocols; workstation clusters; Internet data centers; RAIN project; Rainwall; communication protocols; data storage schemes; data-storage systems; distributed checkpointing system; distributed computing; error-control codes; fault management; fault-tolerant interconnect topologies; fault-tolerant topologies; group membership; heterogeneous cluster of computing; highly-available Web server; highly-available video server; independent nodes; link failures; network protocols; operating system services; proof-of-concept applications; reliable array; research collaboration; software-implemented fault tolerance; spaceborne missions; switch failures; Collaboration; Computer interfaces; Computer networks; Distributed computing; Fault tolerance; Network topology; Operating systems; Protocols; Rain; Switches;
fLanguage
English
Journal_Title
Parallel and Distributed Systems, IEEE Transactions on
Publisher
ieee
ISSN
1045-9219
Type
jour
DOI
10.1109/71.910866
Filename
910866
Link To Document