DocumentCode :
2933342
Title :
A programming model for resilience in extreme scale computing
Author :
Hukerikar, Saurabh ; Diniz, Pedro C. ; Lucas, Robert F.
Author_Institution :
Inf. Sci. Inst., Univ. of Southern California, Marina del Rey, CA, USA
fYear :
2012
fDate :
25-28 June 2012
Firstpage :
1
Lastpage :
6
Abstract :
System resilience is an important challenge that needs to be addressed in the era of extreme scale computing. Exascale supercomputers will be architected using millions of processor cores and memory modules. As process technology scales, the reliability of such systems will be challenged by the inherent unreliability of individual components due to extremely small transistor geometries, variability in silicon manufacturing processes, device aging, etc. Therefore, errors and failures in extreme scale systems will increasingly be the norm rather than the exception. Not all errors detected warrant catastrophic system failure, but there are presently no mechanisms for the programmer to communicate their knowledge of algorithmic fault tolerance to the system. We present a programming model approach for system resilience that allows programmers to explicitly express their fault tolerance knowledge. We propose novel resilience oriented programming model extensions and programming directives, and illustrate their effectiveness. An inference engine leverages this information and combines it with runtime gathered context to increase the dependability of HPC systems.
Keywords :
catastrophe theory; elemental semiconductors; fault tolerance; inference mechanisms; mainframes; manufacturing processes; memory architecture; parallel machines; transistors; HPC systems; algorithmic fault tolerance; catastrophic system failure; device aging; exascale supercomputers; extreme scale computing; fault tolerance knowledge; inference engine; memory modules; processor cores; programming directives; programming model approach; programming model extensions; resilience programming model; silicon manufacturing processes; small transistor geometries; Computational modeling; Context; Engines; Error correction codes; Programming; Resilience; Runtime; Exascale; Fault Tolerance; High-Performance Computing; Resilience;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Dependable Systems and Networks Workshops (DSN-W), 2012 IEEE/IFIP 42nd International Conference on
Conference_Location :
Boston, MA
Print_ISBN :
978-1-4673-2264-5
Electronic_ISBN :
978-1-4673-2265-2
Type :
conf
DOI :
10.1109/DSNW.2012.6264671
Filename :
6264671
Link To Document :
بازگشت