DocumentCode
2933281
Title
Does partial replication pay off?
Author
Stearley, Jon ; Ferreira, Kurt ; Robinson, David ; Laros, Jim ; Pedretti, Kevin ; Arnold, Dorian ; Bridges, Patrick ; Riesen, Rolf
fYear
2012
fDate
25-28 June 2012
Firstpage
1
Lastpage
6
Abstract
As part counts in high performance computing systems are projected to increase faster than part reliabilities, there is increasing interest in enabling jobs to continue to execute in the presence of failures. Process replication has been shown to be a viable method to accomplish this, but previous studies have focussed on full replication levels (dual, triple, etc). In this work, we present a model for studying job interrupt times on systems of arbitrary replication degree, and arbitrary node failure distribution. We show agreement of this model with a previously developed simulator and make three key observations for systems using process replication; 1) job interrupts are not exponentially distributed (even when underlying node failures are), 2) job mean time to interrupt increases exponentially between full replication degrees, and 3) while partial replication may pay off for interrupt-dominated jobs, full replication degrees offer the best overall value.
Keywords
computer network reliability; failure analysis; mainframes; arbitrary node failure distribution; arbitrary replication degree; high performance computing system; job interrupt distribution; job interrupt time; process replication; Ash; Checkpointing; Mathematical model; Redundancy; Runtime; Sockets;
fLanguage
English
Publisher
ieee
Conference_Titel
Dependable Systems and Networks Workshops (DSN-W), 2012 IEEE/IFIP 42nd International Conference on
Conference_Location
Boston, MA
Print_ISBN
978-1-4673-2264-5
Electronic_ISBN
978-1-4673-2265-2
Type
conf
DOI
10.1109/DSNW.2012.6264669
Filename
6264669
Link To Document