Author_Institution :
Key Lab. of Sci. & Technol. for Nat. Defence of Parallel & Distrib. Process., Nat. Univ. of Defence Tech., Changsha, China
Abstract :
As the wide application of multi-core processor architecture in the domain of high performance computing, fault tolerance for shared memory parallel programs becomes a hot spot of research. For years, checkpointing has been the dominant fault tolerance technology in this field, and recently, many research works have been engaged with it. However, to those programs which deal with large amount of data, checkpointing may induce massive I/O transfer, which will adversely affect scalability. To deal with such a problem, this paper proposes a fault tolerance approach, making use of redundancy, for shared memory parallel programs. Our scheme avoids saving and restoring computational state during the program´s execution, hence does not involve I/O operations, so presents explicit advantage over checkpointing in scalability. In this paper, we introduce our approach and the related compiler tool in detail, and give the experimental evaluation result.