DocumentCode :
1926051
Title :
Increasing the availability provided by RADIC with low overhead
Author :
Santos, Guna ; Fialho, Leonardo ; Rexachs, Dolores ; Luque, Emilio
Author_Institution :
Comput. Archit. & Oper. Syst. Dept., Univ. Autonoma de Barcelona, Barcelona, Spain
fYear :
2009
fDate :
Aug. 31 2009-Sept. 4 2009
Firstpage :
1
Lastpage :
8
Abstract :
For machines composed of a large number of processing units, fault probability tends to increase linearly with this number. This makes the use of a fault tolerant solution a major issue. A fault tolerant solution provides certain level of availability, which is usually influenced by time overhead, performance degradation, resources or cost. In the rollback-recovery protocol, the availability increase is usually achieved by increasing the checkpoint frequency or by making several replicas of checkpoints and/or logs. Such a replication allows the solution to tolerate concurrent correlated faults, i.e., a fault in a computing node and in the stable storage. These faults are theoretically less probable, however recent studies have shown that faults are temporally and spatially correlated, consequently increasing the concurrent fault probability. The major concern replicating the checkpoints and logs is the overhead caused by storing these replicas over various repositories, which may disallow its use. In this paper we present how we increased the availability provided by RADIC, without significantly increase of its overhead. Our approach consists of parallelizing the storing of these replicas using the pipeline technique. Such a technique allows us to make low-overhead copies of checkpoints and logs over N protectors. Furthermore, as secondary benefit, the pipelining between observer and protector reduces more than four times (in the best case) the pessimistic message logging overhead.
Keywords :
checkpointing; fault tolerant computing; message passing; pipeline processing; protocols; RADIC; checkpoint frequency; concurrent correlated fault tolerance; fault probability; message passing; performance degradation; pipeline technique; rollback-recovery protocol; system log; time overhead; Availability; Computer architecture; Concurrent computing; Costs; Fault tolerance; Hardware; Pipeline processing; Protection; Protocols; Redundancy;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Cluster Computing and Workshops, 2009. CLUSTER '09. IEEE International Conference on
Conference_Location :
New Orleans, LA
ISSN :
1552-5244
Print_ISBN :
978-1-4244-5011-4
Electronic_ISBN :
1552-5244
Type :
conf
DOI :
10.1109/CLUSTR.2009.5289163
Filename :
5289163
Link To Document :
بازگشت