• DocumentCode
    3245114
  • Title

    Reliable Software Distributed Shared Memory Using Page Migration

  • Author

    Lee, Jinpil ; Sato, Mitsuhisa

  • Author_Institution
    Grad. Sch. of Syst. & Inf. Eng., Univ. of Tsukuba, Tsukuba, Japan
  • fYear
    2009
  • fDate
    8-11 Dec. 2009
  • Firstpage
    456
  • Lastpage
    463
  • Abstract
    Reliability has recently become an important issue in PC cluster technology. This research proposes a software distributed shared memory system, named SCASH-FT, as an execution platform for high performance and highly reliable parallel system for commodity PC clusters. To achieve fault tolerance, each node has redundant page data that allows recovery from node failure using SCASH-FT. All page data is checkpointed and duplicated to another node when a user explicitly calls the checkpoint function. When failure occurs, SCASH-FT invokes the rollback function by restarting an execution from the last checkpoint data. SCASH-FT takes charge of processes such as detecting failure and restarting execution. So, all you have to do is just adding checkpoint function calls in the source code to determine the timing of each checkpoint. Evaluation results show that the checkpoint cost and the rollback penalty depend on the data access pattern and the checkpoint frequency. Thus, users can control their application performance by adjusting checkpoint frequency.
  • Keywords
    checkpointing; distributed memory systems; fault tolerant computing; software reliability; PC cluster technology; checkpoint data; checkpoint frequency; checkpoint function calls; data access pattern; execution platform; fault tolerance; node failure recovery; page data checkpointing; page migration; redundant page data; reliable parallel system; reliable software distributed shared memory; rollback function; rollback penalty; Application software; Cost function; Distributed computing; Fault tolerance; Frequency; Functional programming; High performance computing; Reliability engineering; Software performance; Software systems; fault-tolerant systetem; parallel computing; software distributed shared memory;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Parallel and Distributed Systems (ICPADS), 2009 15th International Conference on
  • Conference_Location
    Shenzhen
  • ISSN
    1521-9097
  • Print_ISBN
    978-1-4244-5788-5
  • Type

    conf

  • DOI
    10.1109/ICPADS.2009.106
  • Filename
    5395317