DocumentCode :
2455801
Title :
HiAL-Ckpt: A hierarchical application-level checkpointing for CPU-GPU hybrid systems
Author :
Xu, Xinhai ; Lin, Yufei ; Tang, Tao ; Lin, Yisong
Author_Institution :
Nat. Lab. for Parallel & Distrib. Process., Nat. Univ. of Defense Technol., Changsha, China
fYear :
2010
fDate :
24-27 Aug. 2010
Firstpage :
1895
Lastpage :
1899
Abstract :
In light of its powerful computing capacity and high energy efficiency, GPU (graphics processing unit) has become a focus in the research field of HPC (High Performance Computing). CPU-GPU heterogeneous parallel systems have become a new development trend of super-computer. However, the inherent unreliability of the GPU hardware deteriorates the reliability of super-computer. We have researched on the fault-tolerance(FT) technique for CPU-GPU heterogeneous parallel systems, and introduced a new checkpointing mechanism, i.e., the hierarchical application-level checkpointing, for such systems. The basic idea of this new checkpointing mechanism is checkpointing at two independent levels, i.e., CPU level and GPU level, to tolerate CPU and GPU faults respectively. Based on the idea, we have also designed and implemented a hierarchical application-level checkpointing tool ”HiAL-Ckpt”. Using this tool, programmers can insert two kinds of directives, i.e., CPU directives and GPU directives into a program, and the compiler will transform the directives into CPU or GPU checkpointing codes according to their nature. From the case study of SWIM, a test bench from spec2000 benchmark suite, we have demonstrated the validity of the hierarchical application-level checkpointing technique. The experimental results show that the falut-tolerance temporal cost of HiAL-Ckpt for SWIM is only 2.25%, compared with the executing time of SWIM without any FT work.
Keywords :
benchmark testing; checkpointing; computer graphic equipment; coprocessors; fault tolerance; fault tolerant computing; multiprocessing systems; parallel processing; program compilers; CPU directive; CPU-GPU heterogeneous parallel system; GPU directive; HiAL-Ckpt; SWIM; checkpointing code; fault tolerance technique; graphics processing unit; hierarchical application level checkpointing; high performance computing; program compiler; spec2000 benchmark suite; supercomputer reliability; Checkpointing; Computational modeling; Fault tolerance; Fault tolerant systems; Graphics processing unit; Hardware; GPU; checkpointing; fault-tolerance; heterogeneous systems;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Computer Science and Education (ICCSE), 2010 5th International Conference on
Conference_Location :
Hefei
Print_ISBN :
978-1-4244-6002-1
Type :
conf
DOI :
10.1109/ICCSE.2010.5593819
Filename :
5593819
Link To Document :
بازگشت