DocumentCode
244518
Title
Opportunistic application-level fault detection through adaptive redundant multithreading
Author
Hukerikar, Saurabh ; Diniz, Pedro C. ; Lucas, Robert F. ; Teranishi, K.
Author_Institution
Inf. Sci. Inst., Univ. of Southern California, Marina del Rey, CA, USA
fYear
2014
fDate
21-25 July 2014
Firstpage
243
Lastpage
250
Abstract
As the scale and complexity of future High Performance Computing systems continues to grow, the rising frequency of faults and errors and their impact on HPC applications will make it increasingly difficult to accomplish useful computation. Traditional means of fault detection and correction are either hardware based or use software based redundancy. Redundancy based approaches usually entail complete replication of the program state or the computation and therefore incurs substantial overhead to application performance. Therefore, the wide-scale use of full redundancy in future exascale class systems is not a viable solution for error detection and correction. In this paper we present an application level fault detection approach that is based on adaptive redundant multithreading. Through a language level directive, the programmer can define structured code blocks. When these blocks are executed by multiple threads and their outputs compared, we can detect errors in specific parts of the program state that will ultimately determine the correctness of the application outcome. The compiler outlines such code blocks and a runtime system reasons whether their execution by redundant threads should enabled/disabled by continuously observing and learning about the fault tolerance state of the system. By providing flexible building blocks for application specific fault detection, our approach makes possible more reasonable performance overheads than full redundancy. Our results show that the overheads to application performance are in the range of 4% to 70% due to runtime system being continuously aware of the rate and source of system faults, rather than the usual overhead in the excess of 100% that is incurred by complete replication.
Keywords
fault tolerant computing; multi-threading; parallel processing; HPC applications; adaptive redundant multithreading; error correction; error detection; fault correction; future exascale class systems; hardware based redundancy; high performance computing systems; language level directive; opportunistic application-level fault detection; redundancy based approach; software based redundancy; structured code blocks; Fault detection; Hardware; Instruction sets; Multithreading; Redundancy; Runtime;
fLanguage
English
Publisher
ieee
Conference_Titel
High Performance Computing & Simulation (HPCS), 2014 International Conference on
Conference_Location
Bologna
Print_ISBN
978-1-4799-5312-7
Type
conf
DOI
10.1109/HPCSim.2014.6903692
Filename
6903692
Link To Document