Opportunistic application-level fault detection through adaptive redundant multithreading

Author

Hukerikar, Saurabh ; Diniz, Pedro C. ; Lucas, Robert F. ; Teranishi, K.

Author_Institution

Inf. Sci. Inst., Univ. of Southern California, Marina del Rey, CA, USA

fYear

2014

fDate

21-25 July 2014

Firstpage

243

Lastpage

250

Abstract

As the scale and complexity of future High Performance Computing systems continues to grow, the rising frequency of faults and errors and their impact on HPC applications will make it increasingly difficult to accomplish useful computation. Traditional means of fault detection and correction are either hardware based or use software based redundancy. Redundancy based approaches usually entail complete replication of the program state or the computation and therefore incurs substantial overhead to application performance. Therefore, the wide-scale use of full redundancy in future exascale class systems is not a viable solution for error detection and correction. In this paper we present an application level fault detection approach that is based on adaptive redundant multithreading. Through a language level directive, the programmer can define structured code blocks. When these blocks are executed by multiple threads and their outputs compared, we can detect errors in specific parts of the program state that will ultimately determine the correctness of the application outcome. The compiler outlines such code blocks and a runtime system reasons whether their execution by redundant threads should enabled/disabled by continuously observing and learning about the fault tolerance state of the system. By providing flexible building blocks for application specific fault detection, our approach makes possible more reasonable performance overheads than full redundancy. Our results show that the overheads to application performance are in the range of 4% to 70% due to runtime system being continuously aware of the rate and source of system faults, rather than the usual overhead in the excess of 100% that is incurred by complete replication.

Keywords

fault tolerant computing; multi-threading; parallel processing; HPC applications; adaptive redundant multithreading; error correction; error detection; fault correction; future exascale class systems; hardware based redundancy; high performance computing systems; language level directive; opportunistic application-level fault detection; redundancy based approach; software based redundancy; structured code blocks; Fault detection; Hardware; Instruction sets; Multithreading; Redundancy; Runtime;

fLanguage

English

Publisher

ieee

Conference_Titel

High Performance Computing & Simulation (HPCS), 2014 International Conference on

Conference_Location

Bologna

Print_ISBN

978-1-4799-5312-7

Type

conf

DOI

10.1109/HPCSim.2014.6903692

Filename

6903692