DocumentCode :
580072
Title :
FALCON: a system for reliable checkpoint recovery in shared grid environments
Author :
Islam, Tanzima Zerin ; Bagchi, Saurabh ; Eigenmann, Rudi
Author_Institution :
Purdue Univ., West Lafayette, IN, USA
fYear :
2009
fDate :
14-20 Nov. 2009
Firstpage :
1
Lastpage :
12
Abstract :
In Fine-Grained Cycle Sharing (FGCS) systems, machine owners voluntarily share their unused CPU cycles with guest jobs, as long as their performance degradation is tolerable. However, unpredictable evictions of guest jobs lead to fluctuating completion times. Checkpoint-recovery is an attractive mechanism for recovering from such ”failures”. Today´s FGCS systems often use expensive, high-performance dedicated checkpoint servers. However, in geographically distributed clusters, this may incur high checkpoint transfer latencies. In this paper we present a system called FALCON that uses available disk resources of the FGCS machines as shared checkpoint repositories. However, an unavailable storage host may lead to loss of checkpoint data. Therefore, we model failures of storage hosts and develop a prediction algorithm for choosing reliable checkpoint repositories. We experiment with FALCON in the university-wide Condor testbed at Purdue and show improved and consistent performance for guest jobs in the presence of irregular resource availability.
Keywords :
checkpointing; fault tolerant computing; grid computing; Condor testbed; FALCON; FGCS machines; FGCS system; Purdue; checkpoint recovery reliability; checkpoint transfer latency; distributed clusters; fine-grained cycle sharing system; guest job unpredictable eviction; high-performance dedicated checkpoint servers; irregular resource availability; prediction algorithm; shared checkpoint repository; shared grid environments; Condor; checkpointing; cycle-sharing systems; failure model; reliability;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
High Performance Computing Networking, Storage and Analysis, Proceedings of the Conference on
Conference_Location :
Portland, OR
Type :
conf
DOI :
10.1145/1654059.1654110
Filename :
6375520
Link To Document :
بازگشت