Title :
Does partial replication pay off?
Author :
Stearley, Jon ; Ferreira, Kurt ; Robinson, David ; Laros, Jim ; Pedretti, Kevin ; Arnold, Dorian ; Bridges, Patrick ; Riesen, Rolf
Abstract :
As part counts in high performance computing systems are projected to increase faster than part reliabilities, there is increasing interest in enabling jobs to continue to execute in the presence of failures. Process replication has been shown to be a viable method to accomplish this, but previous studies have focussed on full replication levels (dual, triple, etc). In this work, we present a model for studying job interrupt times on systems of arbitrary replication degree, and arbitrary node failure distribution. We show agreement of this model with a previously developed simulator and make three key observations for systems using process replication; 1) job interrupts are not exponentially distributed (even when underlying node failures are), 2) job mean time to interrupt increases exponentially between full replication degrees, and 3) while partial replication may pay off for interrupt-dominated jobs, full replication degrees offer the best overall value.
Keywords :
computer network reliability; failure analysis; mainframes; arbitrary node failure distribution; arbitrary replication degree; high performance computing system; job interrupt distribution; job interrupt time; process replication; Ash; Checkpointing; Mathematical model; Redundancy; Runtime; Sockets;
Conference_Titel :
Dependable Systems and Networks Workshops (DSN-W), 2012 IEEE/IFIP 42nd International Conference on
Conference_Location :
Boston, MA
Print_ISBN :
978-1-4673-2264-5
Electronic_ISBN :
978-1-4673-2265-2
DOI :
10.1109/DSNW.2012.6264669