DocumentCode
1835255
Title
A Scalable Checkpoint Encoding Algorithm for Diskless Checkpointing
Author
Chen, Zizhong ; Dongarra, Jack
Author_Institution
Dept. of Math. & Comput. Sci. Golden, Colorado Sch. of Mines, Golden, CO
fYear
2008
fDate
3-5 Dec. 2008
Firstpage
71
Lastpage
79
Abstract
Diskless checkpointing is an efficient technique to save the state of a long running application in a distributed environment without relying on stable storage. In this paper, we introduce several scalable encoding strategies into diskless checkpointing and reduce the overhead to survive k failures in p processes from 2[logp].k((beta + 2gamma)m + alpha) to (1 + O(1/radic(m))).k(beta + 2gamma)m, where a is the communication latency, 1/beta is the network bandwidth between processes, 1/gamma is the rate to perform calculations, and m is the size of local checkpoint per process. The introduced algorithm is scalable in the sense that the overhead to survive k failures in p processes does not increase as the number of processes p increases. We evaluate the performance overhead of the introduced algorithm by using a preconditioned conjugate gradient equation solver as an example. Experimental results demonstrate that the introduced techniques are highly scalable.
Keywords
checkpointing; conjugate gradient methods; encoding; diskless checkpointing; network bandwidth; preconditioned conjugate gradient equation; scalable checkpoint encoding algorithm; Bandwidth; Checkpointing; Contracts; Delay; Encoding; Fault tolerance; High performance computing; Scalability; Systems engineering and theory; USA Councils; Checkpoint; Reed-Solomon encoding; diskless checkpointing; fault tolerance; high performance computing; parallel and distributed systems;
fLanguage
English
Publisher
ieee
Conference_Titel
High Assurance Systems Engineering Symposium, 2008. HASE 2008. 11th IEEE
Conference_Location
Nanjing
ISSN
1530-2059
Print_ISBN
978-0-7695-3482-4
Type
conf
DOI
10.1109/HASE.2008.13
Filename
4708865
Link To Document