Title :
Data storage optimization of application-level checkpointing on heterogeneous systems
Author :
Jia Jia ; Wei Song
Author_Institution :
National Laboratory for Parallel and Distributed Processing, School of Computer, National University of Defense Technology, Changsha, China
Abstract :
General purpose GPU´s (GPGPU) appearance made it possible that heterogeneous computing can be used by human beings. And it´s also produce a reform for GPU´s general purpose computing and parallel computing. Heterogeneous Systems has been adopted by large-scale of high-performance computers. Nowadays, fault tolerance technique is necessary among these large-scale kinds of scientific computing, but in a few years of GPGPU and heterogeneous system appearance, there is not an effective fault tolerance method come out, therefore, towards this situation, this paper will apply the traditional fault tolerance technique—application-level checkpointing to heterogeneous system. Cause the main solution of reducing overhead of the application-level checkpointing is reducing checkpoint data size, so after analyzing the heterogeneous system and GPGPU program, we propose a method to optimize the data storage of application-level checkpointing technique and validate its optimization by experiments.
Keywords :
Checkpointing; Fault tolerance; Fault tolerant systems; Graphics processing units; Hardware; Kernel; Optimization; application-level checkpointing; fault tolerance method; general purpose GPU; heterogeneous system;
Conference_Titel :
Conference Anthology, IEEE
Conference_Location :
China
DOI :
10.1109/ANTHOLOGY.2013.6784773