DocumentCode :
2745958
Title :
Fault propagation analysis based variable length checkpoint placement for fault-tolerant parallel and distributed systems
Author :
Shah, Viral ; Bhattacharya, Sourav
Author_Institution :
Dept. of Comput. Sci. & Eng., Arizona State Univ., Tempe, AZ, USA
fYear :
1997
fDate :
11-15 Aug 1997
Firstpage :
612
Lastpage :
615
Abstract :
The paper proposes optimal checkpoint placement strategies using failure propagation analysis in a distributed rollback recovery system. The authors´ previously proposed idea of failure propagation analysis (FPA) based checkpoint placement strategy is enhanced by incorporating link failures, task grouping/allocation, and loop stabilization aspects. Owing to the empirical observation that a large number of faults occur around message communication instructions, the checkpoint placement strategy places more checkpoints around message send/receive regions of the code. Allocation of tasks (or, threads) onto different processors can lead to varied communication patterns, which in turn can affect the FPA process and the checkpoint placement strategies. Thus, another key contribution of our research is to show the cyclic relationship between checkpointing and task allocation, as well as recursion in parallel or distributed programs. The proposed ideas and FPA approaches are illustrated using a typical parallel algorithm-the fast Fourier transform (FFT)
Keywords :
fast Fourier transforms; message passing; parallel programming; resource allocation; software fault tolerance; system recovery; FFT; FPA based checkpoint placement strategy; FPA process; checkpoint placement strategies; checkpoint placement strategy; cyclic relationship; distributed programs; distributed rollback recovery system; distributed systems; failure propagation analysis; fast Fourier transform; fault propagation analysis based variable length checkpoint placement; fault tolerant parallel systems; link failures; loop stabilization aspects; message communication instructions; message send/receive regions; optimal checkpoint placement strategies; parallel algorithm; recursion; task allocation; task grouping/allocation; Algorithm design and analysis; Checkpointing; Computer aided instruction; Computer science; Concurrent computing; Failure analysis; Fault tolerance; Fault tolerant systems; Parallel algorithms; Yarn;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Computer Software and Applications Conference, 1997. COMPSAC '97. Proceedings., The Twenty-First Annual International
Conference_Location :
Washington, DC
ISSN :
0730-3157
Print_ISBN :
0-8186-8105-5
Type :
conf
DOI :
10.1109/CMPSAC.1997.625081
Filename :
625081
Link To Document :
بازگشت