DocumentCode :
3021831
Title :
A Programming Language Approach to Fault Tolerance for Fork-Join Parallelism
Author :
Zengin, Mustafa ; Vafeiadis, Viktor
fYear :
2013
fDate :
1-3 July 2013
Firstpage :
105
Lastpage :
112
Abstract :
When running big parallel computations on thousands of processors, the probability that an individual processor will fail during the execution cannot be ignored. Computations should be replicated, or else failures should be detected at runtime and failed subcomputations reexecuted. We follow the latter approach and propose a high-level operational semantics that detects computation failures, and allows failed computations to be restarted from the point of failure. We implement this high-level semantics with a lower-level operational semantics that provides a more accurate account of processor failures, and prove in Coq the correspondence between the high- and low-level semantics.
Keywords :
checkpointing; fault tolerant computing; parallel processing; programming language semantics; Coq; checkpointing; computation failure detection; fault tolerance; fork-join parallelism; high-level operational semantics; lower-level operational semantics; parallel computations; processor failures; programming language; Checkpointing; Computational modeling; Context; Parallel processing; Program processors; Semantics; Standards;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Theoretical Aspects of Software Engineering (TASE), 2013 International Symposium on
Conference_Location :
Birmingham
Type :
conf
DOI :
10.1109/TASE.2013.22
Filename :
6597884
Link To Document :
بازگشت