DocumentCode
88489
Title
Fault Tolerance on Large Scale Systems using Adaptive Process Replication
Author
George, Cijo ; Vadhiyar, Sathish
Author_Institution
NetApp Adv. Technol. Group, Bangalore, India
Volume
64
Issue
8
fYear
2015
fDate
Aug. 1 2015
Firstpage
2213
Lastpage
2225
Abstract
Exascale systems of the future are predicted to have mean time between failures (MTBF) of less than one hour. At such low MTBFs, employing periodic checkpointing alone will result in low efficiency because of the high number of application failures resulting in large amount of lost work due to rollbacks. In such scenarios, it is highly necessary to have proactive fault tolerance mechanisms that can help avoid significant number of failures. In this work, we have developed a mechanism for proactive fault tolerance using partial replication of a set of application processes. Our fault tolerance framework adaptively changes the set of replicated processes periodically based on failure predictions to avoid failures. We have developed an MPI prototype implementation, PAREP-MPI that allows changing the replica set. We have shown that our strategy involving adaptive process replication significantly outperforms existing mechanisms providing up to 20 percent improvement in application efficiency even for exascale systems.
Keywords
checkpointing; fault tolerant computing; message passing; MPI prototype implementation; MTBF; PAREP-MPI; adaptive process replication; application failures; exascale systems; fault tolerance framework; large scale systems; mean time between failures; partial replication; periodic checkpointing; proactive fault tolerance mechanisms; Checkpointing; Fault tolerant systems; Large-scale systems; Libraries; Receivers; Redundancy; Exascale systems; Fault tolerance; Process replication; exascale systems; process replication;
fLanguage
English
Journal_Title
Computers, IEEE Transactions on
Publisher
ieee
ISSN
0018-9340
Type
jour
DOI
10.1109/TC.2014.2360536
Filename
6911991
Link To Document