• DocumentCode
    3429799
  • Title

    Classifying soft error vulnerabilities in extreme-Scale scientific applications using a binary instrumentation tool

  • Author

    Dong Li ; Vetter, Jeffrey S. ; Weikuan Yu

  • fYear
    2012
  • fDate
    10-16 Nov. 2012
  • Firstpage
    1
  • Lastpage
    11
  • Abstract
    Extreme-scale scientific applications are at a significant risk of being hit by soft errors on supercomputers as the scale of these systems and the component density continues to increase. In order to better understand the specific soft error vulnerabilities in scientific applications, we have built an empirical fault injection and consequence analysis tool - BIFIT -that allows us to evaluate how soft errors impact applications. In particular, BIFIT is designed with capability to inject faults at very specific targets: an arbitrarily-chosen execution point and any specific data structure. We apply BIFIT to three mission-critical scientific applications and investigate the applications vulnerability to soft errors by performing thousands of statistical tests. We, then, classify each applications individual data structures based on their sensitivity to these vulnerabilities, and generalize these classifications across applications. Subsequently, these classifications can be used to apply appropriate resiliency solutions to each data structure within an application. Our study reveals that these scientific applications have a wide range of sensitivities to both the time and the location of a soft error; yet, we are able to identify intrinsic relationships between application vulnerabilities and specific types of data objects. In this regard, BIFIT enables new opportunities for future resiliency research.
  • Keywords
    fault diagnosis; parallel machines; scientific information systems; statistical testing; BIFIT; binary instrumentation tool; component density; consequence analysis tool; data structure; empirical fault injection; extreme-scale scientific application; mission-critical scientific application; soft error vulnerabilities; statistical test; supercomputer; Algorithm design and analysis; Data structures; Hardware; Instruments; Libraries; Object recognition; Resilience;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    High Performance Computing, Networking, Storage and Analysis (SC), 2012 International Conference for
  • Conference_Location
    Salt Lake City, UT
  • ISSN
    2167-4329
  • Print_ISBN
    978-1-4673-0805-2
  • Type

    conf

  • DOI
    10.1109/SC.2012.29
  • Filename
    6468536