Title :
Lightweight blocking coordinated checkpointing for cluster computer systems
Author :
Lotfi, Mehdi ; Motamedi, Seyed Ahmad ; Bandarabadi, Mojtaba
Author_Institution :
Electr. Eng. Dept., Amirkabir Univ. of Technol., Tehran
Abstract :
In this paper we introduce a new approach for blocking coordinated checkpointing using two level checkpointing for high performance cluster computing systems. First level of checkpointing is local checkpointing and computing nodes save the checkpoints in local disk based on transient failure rates. If a transient failure occurs in the computing node, process can recover from local disk. Second level of checkpointing is global checkpointing and computing nodes send their checkpoints to high reliable global stable storage in network based on the permanent failure rate. If a permanent failure occurs in the computing node, computing node can not be used and process can recover from global storage in a new computing node. Transient failures are probable than permanent failures and the number of global checkpointing is very lower than local checkpointing. Based on this method, coordinated checkpointing overhead is reduced and it is proportional to transient and permanent failure rates of cluster systems.
Keywords :
checkpointing; computer network reliability; workstation clusters; cluster computer system; computing node; global stable network storage reliability; lightweight blocking coordinated checkpointing; local disk checkpoint; permanent failure rate; transient failure rate; Checkpointing; Clustering algorithms; Communication channels; Computer networks; Degradation; Fault tolerant systems; Frequency synchronization; High performance computing; Space technology; System performance; blocking coordinated checkpointing; permanet failure; transient failure; two- level checkpointing;
Conference_Titel :
System Theory, 2009. SSST 2009. 41st Southeastern Symposium on
Conference_Location :
Tullahoma, TN
Print_ISBN :
978-1-4244-3324-7
Electronic_ISBN :
0094-2898
DOI :
10.1109/SSST.2009.4806838