Title :
A paradigm shift is coming - continuous failure
Author_Institution :
Oak Ridge National Laboratory, P.O. Box 2008, Tennessee, USA
Abstract :
Resilience is a measure of the ability of a computing system and its applications to continue working in the presence of system degradations and failures. This talk presents the factors that are driving an exponential increase in system fault rate. At the rate of increase, if the hardware and software are not fault tolerant at Exascale, then even relatively short-lived applications are unlikely to finish; or worse, the applications may complete with incorrect results. New paradigms must be developed for handling faults within both the system software and user applications. Also presented are new approaches for integrating detection algorithms in both the hardware and software and new techniques to help simulations adapt to faults.
Keywords :
Continuous Failure; Fault Tolerance; Resilient Systems;
Conference_Titel :
Collaboration Technologies and Systems (CTS), 2012 International Conference on
Conference_Location :
Denver, CO, USA
Print_ISBN :
978-1-4673-1381-0
DOI :
10.1109/CTS.2012.6261077