Title :
What is Missing in Current Checkpoint Interval Models?
Author :
Fialho, Leonardo ; Rexachs, Dolores ; Luque, Emilio
Author_Institution :
Dept. of Comput. Archit. & Oper. Syst., Univ. Autonoma of Barcelona, Barcelona, Spain
Abstract :
The growth in the number of components that compose parallel computers increases their fault frequency. Currently, in such systems faults are no longer a rare event but a common problem, thus some sort of fault tolerance should be provided. In general, fault tolerance protocols rely on checkpoints. A common question surrounding check pointing is the definition of the checkpoint interval. In this paper we propose the modelling of the relationship established between the parallel applications processes due to the messages exchange in order to incorporate this relationship into current checkpoint interval models. The experimental evaluation shows that the use of our checkpoint interval model based on the definition of the parallel application inter-process dependency factor is effective to calculate the checkpoint interval for parallel applications. Our results demonstrate that the overhead prediction error is smaller than 4% in comparison with the application execution.
Keywords :
checkpointing; parallel processing; software fault tolerance; checkpoint interval model; checkpointing; fault tolerance protocol; parallel application interprocess dependency factor; parallel computer fault frequency; Checkpointing; Computational modeling; Equations; Fault tolerance; Fault tolerant systems; Mathematical model; Protocols; checkpoint interval; fault tolerance; model; mpi; parallel applications;
Conference_Titel :
Distributed Computing Systems (ICDCS), 2011 31st International Conference on
Conference_Location :
Minneapolis, MN
Print_ISBN :
978-1-61284-384-1
Electronic_ISBN :
1063-6927
DOI :
10.1109/ICDCS.2011.12