DocumentCode :
1805863
Title :
Enhancing reliability and response times via replication in computing clusters
Author :
Zhan Qiu ; Perez, Juan F.
Author_Institution :
Dept. of Comput., Imperial Coll. London, London, UK
fYear :
2015
fDate :
April 26 2015-May 1 2015
Firstpage :
1355
Lastpage :
1363
Abstract :
Computing clusters have been widely deployed for scientific and engineering applications to support intensive computation and massive data operations. As applications and resources in a cluster are subject to failures, fault-tolerance strategies are commonly adopted, sometimes at the expense of additional delays in job response times, or unnecessarily increasing resource usage. In this paper, we explore concurrent replication with canceling, a fault-tolerance approach where jobs and their replicas are processed concurrently, and the successful completion of either triggers the removals of its replica. We propose a stochastic model to study how this approach affects the cluster service level objectives (SLOs), particularly the offered response time percentiles. In addition to the expected gains in reliability, the proposed model allows us to determine the regions of the utilization where introducing replication with canceling effectively reduces the response times. Moreover, we show how this model can support resource provisioning decisions with reliability and response time guarantees.
Keywords :
concurrency (computers); fault tolerant computing; software reliability; system recovery; SLO; computing clusters; concurrent replication; failures; fault-tolerance strategies; intensive computation; job response times; massive data operations; reliability; service level objectives; Computational modeling; Computers; Conferences; Reliability; Servers; Switches; Time factors;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Computer Communications (INFOCOM), 2015 IEEE Conference on
Conference_Location :
Kowloon
Type :
conf
DOI :
10.1109/INFOCOM.2015.7218512
Filename :
7218512
Link To Document :
بازگشت