DocumentCode :
2960790
Title :
Two-level checkpoint/restart modeling for GPGPU
Author :
Laosooksathit, Supada ; Naksinehaboon, Nichamon ; Leangsuksan, Chokchai
Author_Institution :
Dept. of Comput. Sci., Louisiana Tech Univ., Ruston, LA, USA
fYear :
2011
fDate :
27-30 Dec. 2011
Firstpage :
276
Lastpage :
283
Abstract :
Due to the fact that the reliability and availability of a large scaled system inverse to the number of computing elements, fault tolerance has become a major concern in high performance computing (HPC) including a very large system with GPGPU. In this paper, we propose a checkpoint/restart mechanism model which employs two-phase protocol and a latency hiding technique such as CUDA streams in order to achieve a low checkpoint overhead. We introduce GPU checkpoint and restart protocols. Also, we show experimental results and analyze the influences of the mechanism, especially in a long-running application.
Keywords :
checkpointing; fault tolerant computing; graphics processing units; CUDA streams; GPGPU; fault tolerance; high performance computing; large scaled system; latency hiding technique; restart protocols; two-level checkpoint mechanism modeling; two-level restart mechanism modeling; two-phase protocol; Arrays; Checkpointing; Fault tolerance; Fault tolerant systems; Graphics processing unit; Kernel; Protocols;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Computer Systems and Applications (AICCSA), 2011 9th IEEE/ACS International Conference on
Conference_Location :
Sharm El-Sheikh
ISSN :
2161-5322
Print_ISBN :
978-1-4577-0475-8
Electronic_ISBN :
2161-5322
Type :
conf
DOI :
10.1109/AICCSA.2011.6126619
Filename :
6126619
Link To Document :
بازگشت