DocumentCode
3245114
Title
Reliable Software Distributed Shared Memory Using Page Migration
Author
Lee, Jinpil ; Sato, Mitsuhisa
Author_Institution
Grad. Sch. of Syst. & Inf. Eng., Univ. of Tsukuba, Tsukuba, Japan
fYear
2009
fDate
8-11 Dec. 2009
Firstpage
456
Lastpage
463
Abstract
Reliability has recently become an important issue in PC cluster technology. This research proposes a software distributed shared memory system, named SCASH-FT, as an execution platform for high performance and highly reliable parallel system for commodity PC clusters. To achieve fault tolerance, each node has redundant page data that allows recovery from node failure using SCASH-FT. All page data is checkpointed and duplicated to another node when a user explicitly calls the checkpoint function. When failure occurs, SCASH-FT invokes the rollback function by restarting an execution from the last checkpoint data. SCASH-FT takes charge of processes such as detecting failure and restarting execution. So, all you have to do is just adding checkpoint function calls in the source code to determine the timing of each checkpoint. Evaluation results show that the checkpoint cost and the rollback penalty depend on the data access pattern and the checkpoint frequency. Thus, users can control their application performance by adjusting checkpoint frequency.
Keywords
checkpointing; distributed memory systems; fault tolerant computing; software reliability; PC cluster technology; checkpoint data; checkpoint frequency; checkpoint function calls; data access pattern; execution platform; fault tolerance; node failure recovery; page data checkpointing; page migration; redundant page data; reliable parallel system; reliable software distributed shared memory; rollback function; rollback penalty; Application software; Cost function; Distributed computing; Fault tolerance; Frequency; Functional programming; High performance computing; Reliability engineering; Software performance; Software systems; fault-tolerant systetem; parallel computing; software distributed shared memory;
fLanguage
English
Publisher
ieee
Conference_Titel
Parallel and Distributed Systems (ICPADS), 2009 15th International Conference on
Conference_Location
Shenzhen
ISSN
1521-9097
Print_ISBN
978-1-4244-5788-5
Type
conf
DOI
10.1109/ICPADS.2009.106
Filename
5395317
Link To Document