DocumentCode :
2887376
Title :
A paradigm shift is coming - continuous failure
Author :
Geist, Al
Author_Institution :
Oak Ridge National Laboratory, P.O. Box 2008, Tennessee, USA
fYear :
2012
fDate :
21-25 May 2012
Firstpage :
371
Lastpage :
371
Abstract :
Resilience is a measure of the ability of a computing system and its applications to continue working in the presence of system degradations and failures. This talk presents the factors that are driving an exponential increase in system fault rate. At the rate of increase, if the hardware and software are not fault tolerant at Exascale, then even relatively short-lived applications are unlikely to finish; or worse, the applications may complete with incorrect results. New paradigms must be developed for handling faults within both the system software and user applications. Also presented are new approaches for integrating detection algorithms in both the hardware and software and new techniques to help simulations adapt to faults.
Keywords :
Continuous Failure; Fault Tolerance; Resilient Systems;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Collaboration Technologies and Systems (CTS), 2012 International Conference on
Conference_Location :
Denver, CO, USA
Print_ISBN :
978-1-4673-1381-0
Type :
conf
DOI :
10.1109/CTS.2012.6261077
Filename :
6261077
Link To Document :
بازگشت