Title :
Exploit failure prediction for adaptive fault-tolerance in cluster computing
Author :
Li, Yawei ; Lan, Zhiling
Author_Institution :
Dept. of Comput. Sci., Illinois Inst. of Technol., Chicago, IL, USA
Abstract :
As the scale of cluster computing grows, it is becoming hard for long-running applications to complete without facing failures on large-scale clusters. To address this issue, checkpointing/restart is widely used to provide the basic fault-tolerant functionality, yet it suffers from high overhead and its reactive characteristic. In this work, we propose FT-Pro, an adaptive fault management mechanism that optimally chooses migration, checkpointing or no action to reduce the application execution time in the presence of failures based on the failure prediction. A cost-based evaluation model is presented for dynamic decision at run-time. Using the actual failure log from a production cluster at NCSA, we demonstrate that even with modest failure prediction accuracy, FT-Pro outperforms the traditional checkpointing/restart strategy by 13%-30% in terms of reducing the application execution time despite failures, which is a significant performance improvement for long-running applications.
Keywords :
checkpointing; fault tolerant computing; workstation clusters; FT-Pro; adaptive fault management mechanism; adaptive fault-tolerance; checkpointing/restart; cluster computing; failure prediction; fault-tolerant functionality; large-scale clusters; Accuracy; Application software; Checkpointing; Circuit faults; Computer science; Costs; Fault tolerance; Hardware; Large-scale systems; Performance loss;
Conference_Titel :
Cluster Computing and the Grid, 2006. CCGRID 06. Sixth IEEE International Symposium on
Conference_Location :
Singapore
Print_ISBN :
0-7695-2585-7
DOI :
10.1109/CCGRID.2006.45