DocumentCode :
2177537
Title :
On the Combination of Silent Error Detection and Checkpointing
Author :
Aupy, Guillaume ; Benoit, A. ; Herault, Thomas ; Robert, Yannick ; Vivien, F. ; Zaidouni, Dounia
Author_Institution :
Ecole Normale Super. de Lyon, Lyon, France
fYear :
2013
fDate :
2-4 Dec. 2013
Firstpage :
11
Lastpage :
20
Abstract :
In this paper, we revisit traditional check pointing and rollback recovery strategies, with a focus on silent data corruption errors. Contrarily to fail-stop failures, such latent errors cannot be detected immediately, and a mechanism to detect them must be provided. We consider two models: (i) errors are detected after some delays following a probability distribution (typically, an Exponential distribution), (ii) errors are detected through some verification mechanism. In both cases, we compute the optimal period in order to minimize the waste, i.e., the fraction of time where nodes do not perform useful computations. In practice, only a fixed number of checkpoints can be kept in memory, and the first model may lead to an irrecoverable failure. In this case, we compute the minimum period required for an acceptable risk. For the second model, there is no risk of irrecoverable failure, owing to the verification mechanism, but the corresponding overhead is included in the waste. Finally, both models are instantiated using realistic scenarios and application/architecture parameters.
Keywords :
checkpointing; error detection; exponential distribution; parallel processing; exponential distribution; high performance computing; probability distribution; rollback recovery strategies; silent data corruption error detection; traditional check pointing; Approximation methods; Checkpointing; Computational modeling; Equations; Exponential distribution; Mathematical model; Xenon; High-performance computing; checkpointing; error recovery; silent data corruption; verification;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Dependable Computing (PRDC), 2013 IEEE 19th Pacific Rim International Symposium on
Conference_Location :
Vancouver, BC
Type :
conf
DOI :
10.1109/PRDC.2013.10
Filename :
6820836
Link To Document :
بازگشت