• DocumentCode
    1918587
  • Title

    Poster: Programming Model Extensions for Resilience in Extreme Scale Computing

  • Author

    Hukerikar, Saurabh ; Diniz, Pedro C. ; Lucas, Robert F.

  • fYear
    2012
  • fDate
    10-16 Nov. 2012
  • Firstpage
    1434
  • Lastpage
    1434
  • Abstract
    System resilience is a key challenge to building extreme scale systems. A large number of HPC applications are inherently resilient, but application programmers lack mechanisms to convey their fault tolerance knowledge to the system. We present a cross-layer approach to resilience in which we propose a set of programming model extensions and develop a runtime inference framework that can reason about the context and significance of faults, as they occur, to the application programmer´s fault tolerance expectations. We demonstrate using a set accelerated fault injection experiments the validity of our approach with a set of real scientific and engineering codes. Our experiments show that a cross-layer approach that explicitly engages the programmer in expressing fault tolerance knowledge which is then leveraged across the layers of system abstraction can significantly improve the dependability of long running HPC applications.
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    High Performance Computing, Networking, Storage and Analysis (SCC), 2012 SC Companion:
  • Conference_Location
    Salt Lake City, UT
  • Print_ISBN
    978-1-4673-6218-4
  • Type

    conf

  • DOI
    10.1109/SC.Companion.2012.239
  • Filename
    6496022