Title :
Fault injection experiment results in space borne parallel application programs
Author :
Some, Raphael R. ; Kim, Won S. ; Khanoyan, Garen ; Callum, Leslie ; Agrawal, Anil ; Beahan, John J. ; Shamilian, Arshaluys ; Nikora, Allen
Author_Institution :
Jet Propulsion Lab., California Inst. of Technol., Pasadena, CA, USA
Abstract :
Development of the REE Commercial-Off-The-Shelf (COTS) based space-borne supercomputer requires a detailed knowledge of system behavior in the presence of Single Event Upset (SEU) induced faults. When combined with a hardware radiation fault model and mission environment data in a medium grained system model, experimentally obtained fault behavior data can be used to: predict system reliability, availability and performance; determine optimal fault detection methods and boundaries; and define high ROI fault tolerance strategies. The REE project has developed a fault injection suite of tools and a methodology for experimentally determining system behavior statistics in the presence of application level SEU induced transient faults. Initial characterization of science data application code for an autonomous Mars Rover geology application indicates that this code is relatively insensitive to SEUs and thus can be made highly immune to application level faults with relatively low overhead strategies.
Keywords :
aerospace computing; fault tolerant computing; parallel machines; parallel programming; radiation effects; software performance evaluation; software reliability; REE COTS based space-borne supercomputer; SEU induced faults; application level SEU induced transient faults; autonomous Mars Rover geology application; fault behavior data; fault injection tool suite; fault tolerance strategies; hardware radiation fault model; medium grained system model; mission environment data; optimal fault detection methods; single event upsets; space borne parallel application programs; system availability; system behavior statistics; system performance; system reliability; Availability; Fault detection; Fault tolerant systems; Hardware; Mars; Predictive models; Reliability; Single event upset; Statistics; Supercomputers;
Conference_Titel :
Aerospace Conference Proceedings, 2002. IEEE
Print_ISBN :
0-7803-7231-X
DOI :
10.1109/AERO.2002.1035379