DocumentCode :
3245114
Title :
Reliable Software Distributed Shared Memory Using Page Migration
Author :
Lee, Jinpil ; Sato, Mitsuhisa
Author_Institution :
Grad. Sch. of Syst. & Inf. Eng., Univ. of Tsukuba, Tsukuba, Japan
fYear :
2009
fDate :
8-11 Dec. 2009
Firstpage :
456
Lastpage :
463
Abstract :
Reliability has recently become an important issue in PC cluster technology. This research proposes a software distributed shared memory system, named SCASH-FT, as an execution platform for high performance and highly reliable parallel system for commodity PC clusters. To achieve fault tolerance, each node has redundant page data that allows recovery from node failure using SCASH-FT. All page data is checkpointed and duplicated to another node when a user explicitly calls the checkpoint function. When failure occurs, SCASH-FT invokes the rollback function by restarting an execution from the last checkpoint data. SCASH-FT takes charge of processes such as detecting failure and restarting execution. So, all you have to do is just adding checkpoint function calls in the source code to determine the timing of each checkpoint. Evaluation results show that the checkpoint cost and the rollback penalty depend on the data access pattern and the checkpoint frequency. Thus, users can control their application performance by adjusting checkpoint frequency.
Keywords :
checkpointing; distributed memory systems; fault tolerant computing; software reliability; PC cluster technology; checkpoint data; checkpoint frequency; checkpoint function calls; data access pattern; execution platform; fault tolerance; node failure recovery; page data checkpointing; page migration; redundant page data; reliable parallel system; reliable software distributed shared memory; rollback function; rollback penalty; Application software; Cost function; Distributed computing; Fault tolerance; Frequency; Functional programming; High performance computing; Reliability engineering; Software performance; Software systems; fault-tolerant systetem; parallel computing; software distributed shared memory;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Parallel and Distributed Systems (ICPADS), 2009 15th International Conference on
Conference_Location :
Shenzhen
ISSN :
1521-9097
Print_ISBN :
978-1-4244-5788-5
Type :
conf
DOI :
10.1109/ICPADS.2009.106
Filename :
5395317
Link To Document :
بازگشت