DocumentCode
1918587
Title
Poster: Programming Model Extensions for Resilience in Extreme Scale Computing
Author
Hukerikar, Saurabh ; Diniz, Pedro C. ; Lucas, Robert F.
fYear
2012
fDate
10-16 Nov. 2012
Firstpage
1434
Lastpage
1434
Abstract
System resilience is a key challenge to building extreme scale systems. A large number of HPC applications are inherently resilient, but application programmers lack mechanisms to convey their fault tolerance knowledge to the system. We present a cross-layer approach to resilience in which we propose a set of programming model extensions and develop a runtime inference framework that can reason about the context and significance of faults, as they occur, to the application programmer´s fault tolerance expectations. We demonstrate using a set accelerated fault injection experiments the validity of our approach with a set of real scientific and engineering codes. Our experiments show that a cross-layer approach that explicitly engages the programmer in expressing fault tolerance knowledge which is then leveraged across the layers of system abstraction can significantly improve the dependability of long running HPC applications.
fLanguage
English
Publisher
ieee
Conference_Titel
High Performance Computing, Networking, Storage and Analysis (SCC), 2012 SC Companion:
Conference_Location
Salt Lake City, UT
Print_ISBN
978-1-4673-6218-4
Type
conf
DOI
10.1109/SC.Companion.2012.239
Filename
6496022
Link To Document