DocumentCode :
1763669
Title :
Reliability of Heterogeneous Distributed Computing Systems in the Presence of Correlated Failures
Author :
Pezoa, Jorge E. ; Hayat, Majeed M.
Author_Institution :
Dept. of Electr. & Comput. Eng., Univ. of New Mexico, Albuquerque, NM, USA
Volume :
25
Issue :
4
fYear :
2014
fDate :
41730
Firstpage :
1034
Lastpage :
1043
Abstract :
While the reliability of distributed-computing systems (DCSs) has been widely studied under the assumption that computing elements (CEs) fail independently, the impact of correlated failures of CEs on the reliability remains an open question. Here, the problem of modeling and assessing the impact of stochastic, correlated failures on the service reliability of applications running on DCSs is tackled. The service reliability is modeled using an integrated analytical and Monte-Carlo (MC) approach. The analytical component of the model comprises a generalization of a previously developed model for reliability of non-Markovian DCSs to a setting where specific patterns of simultaneous failures in CEs are allowed. The analytical model is complemented by a MC-based procedure to draw correlated-failure patterns using the recently reported concept of probabilistic shared risk groups (PSRGs). The reliability model is further utilized to develop and optimize a novel class of dynamic task reallocation (DTR) policies that maximize the reliability of DCSs in the presence of correlated failures. Theoretical predictions, MC simulations, and results from an emulation testbed show that the reliability can be improved when DTR policies correctly account for correlated failures. The impact of correlated failures of CEs on the reliability and the key dependence of DTR policies on the type of correlated failures are also investigated.
Keywords :
Monte Carlo methods; distributed processing; fault tolerant computing; stochastic processes; system recovery; DCS reliability model; DTR policies; Monte Carlo approach; PSRG; computing elements; correlated failures; draw correlated-failure patterns; dynamic task reallocation; generalization; heterogeneous distributed computing systems; nonMarkovian DCS; probabilistic shared risk groups; service reliability; Analytical models; Computational modeling; Correlation; Reliability; Servers; Vectors; Distributed computing; load balancing; non-Markovian process; reliability; shared risk group; spatially correlated failures;
fLanguage :
English
Journal_Title :
Parallel and Distributed Systems, IEEE Transactions on
Publisher :
ieee
ISSN :
1045-9219
Type :
jour
DOI :
10.1109/TPDS.2013.78
Filename :
6482556
Link To Document :
بازگشت