• DocumentCode
    244518
  • Title

    Opportunistic application-level fault detection through adaptive redundant multithreading

  • Author

    Hukerikar, Saurabh ; Diniz, Pedro C. ; Lucas, Robert F. ; Teranishi, K.

  • Author_Institution
    Inf. Sci. Inst., Univ. of Southern California, Marina del Rey, CA, USA
  • fYear
    2014
  • fDate
    21-25 July 2014
  • Firstpage
    243
  • Lastpage
    250
  • Abstract
    As the scale and complexity of future High Performance Computing systems continues to grow, the rising frequency of faults and errors and their impact on HPC applications will make it increasingly difficult to accomplish useful computation. Traditional means of fault detection and correction are either hardware based or use software based redundancy. Redundancy based approaches usually entail complete replication of the program state or the computation and therefore incurs substantial overhead to application performance. Therefore, the wide-scale use of full redundancy in future exascale class systems is not a viable solution for error detection and correction. In this paper we present an application level fault detection approach that is based on adaptive redundant multithreading. Through a language level directive, the programmer can define structured code blocks. When these blocks are executed by multiple threads and their outputs compared, we can detect errors in specific parts of the program state that will ultimately determine the correctness of the application outcome. The compiler outlines such code blocks and a runtime system reasons whether their execution by redundant threads should enabled/disabled by continuously observing and learning about the fault tolerance state of the system. By providing flexible building blocks for application specific fault detection, our approach makes possible more reasonable performance overheads than full redundancy. Our results show that the overheads to application performance are in the range of 4% to 70% due to runtime system being continuously aware of the rate and source of system faults, rather than the usual overhead in the excess of 100% that is incurred by complete replication.
  • Keywords
    fault tolerant computing; multi-threading; parallel processing; HPC applications; adaptive redundant multithreading; error correction; error detection; fault correction; future exascale class systems; hardware based redundancy; high performance computing systems; language level directive; opportunistic application-level fault detection; redundancy based approach; software based redundancy; structured code blocks; Fault detection; Hardware; Instruction sets; Multithreading; Redundancy; Runtime;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    High Performance Computing & Simulation (HPCS), 2014 International Conference on
  • Conference_Location
    Bologna
  • Print_ISBN
    978-1-4799-5312-7
  • Type

    conf

  • DOI
    10.1109/HPCSim.2014.6903692
  • Filename
    6903692