Title :
Application-Driven Coordination-Free Distributed Checkpointing
Author :
Agbaria, Adnan ; Sanders, William H.
Author_Institution :
Coordinated Sci. Lab., Univ. of Illinois at Urbana-Champaign, Urbana, IL
Abstract :
Distributed checkpointing is an important concept in providing fault tolerance in distributed systems. In today´s applications, e.g., grid and massively parallel applications, the imposed overhead of taking a distributed checkpoint using the known approaches can often outweigh its benefits due to coordination and other overhead from the processes. This paper presents an innovative approach for distributed checkpointing. In this approach, the checkpoints are obtained using offline analysis based on the application level. During execution, no coordination is required. After presenting the approach, the authors proved its safety and present a performance analysis of it using stochastic models
Keywords :
checkpointing; distributed algorithms; software fault tolerance; stochastic processes; application driven coordination; distributed systems; fault tolerance; free distributed checkpointing; stochastic models; Checkpointing; Distributed computing; Fault tolerant systems; Force control; Grid computing; Internet; Performance analysis; Protocols; Safety; Stochastic processes;
Conference_Titel :
Distributed Computing Systems, 2005. ICDCS 2005. Proceedings. 25th IEEE International Conference on
Conference_Location :
Columbus, OH
Print_ISBN :
0-7695-2331-5
DOI :
10.1109/ICDCS.2005.14