DocumentCode
3429799
Title
Classifying soft error vulnerabilities in extreme-Scale scientific applications using a binary instrumentation tool
Author
Dong Li ; Vetter, Jeffrey S. ; Weikuan Yu
fYear
2012
fDate
10-16 Nov. 2012
Firstpage
1
Lastpage
11
Abstract
Extreme-scale scientific applications are at a significant risk of being hit by soft errors on supercomputers as the scale of these systems and the component density continues to increase. In order to better understand the specific soft error vulnerabilities in scientific applications, we have built an empirical fault injection and consequence analysis tool - BIFIT -that allows us to evaluate how soft errors impact applications. In particular, BIFIT is designed with capability to inject faults at very specific targets: an arbitrarily-chosen execution point and any specific data structure. We apply BIFIT to three mission-critical scientific applications and investigate the applications vulnerability to soft errors by performing thousands of statistical tests. We, then, classify each applications individual data structures based on their sensitivity to these vulnerabilities, and generalize these classifications across applications. Subsequently, these classifications can be used to apply appropriate resiliency solutions to each data structure within an application. Our study reveals that these scientific applications have a wide range of sensitivities to both the time and the location of a soft error; yet, we are able to identify intrinsic relationships between application vulnerabilities and specific types of data objects. In this regard, BIFIT enables new opportunities for future resiliency research.
Keywords
fault diagnosis; parallel machines; scientific information systems; statistical testing; BIFIT; binary instrumentation tool; component density; consequence analysis tool; data structure; empirical fault injection; extreme-scale scientific application; mission-critical scientific application; soft error vulnerabilities; statistical test; supercomputer; Algorithm design and analysis; Data structures; Hardware; Instruments; Libraries; Object recognition; Resilience;
fLanguage
English
Publisher
ieee
Conference_Titel
High Performance Computing, Networking, Storage and Analysis (SC), 2012 International Conference for
Conference_Location
Salt Lake City, UT
ISSN
2167-4329
Print_ISBN
978-1-4673-0805-2
Type
conf
DOI
10.1109/SC.2012.29
Filename
6468536
Link To Document