DocumentCode :
3696960
Title :
Exploiting Spatial Smoothness in HPC Applications to Detect Silent Data Corruption
Author :
Leonardo Bautista-Gomez;Franck Cappello
fYear :
2015
Firstpage :
128
Lastpage :
133
Abstract :
Next-generation supercomputers are expected to have more components and, at the same time, consume several times less energy per operation. This situation is pushing supercomputer constructors to the limits of miniaturization and energy-saving strategies. Consequently, the number of soft errors is expected to increase dramatically in the coming years. While mechanisms are in place to correct or at least detect soft errors, a percentage of those errors pass unnoticed by the system. Such silent errors are extremely damaging because they can make applications produce wrong results. In this paper we propose a technique that leverages certain properties of HPC applications in order to detect silent errors at the application level. Our technique detects corruption solely based on the data behavior and is algorithm-agnostic. We show that this strategy can detect up to 90% of injected errors in some regions while incurring less than 1% overhead.
Keywords :
"Detectors","Supercomputers","Random access memory","Error correction codes","Entropy","Reliability","Registers"
Publisher :
ieee
Conference_Titel :
High Performance Computing and Communications (HPCC), 2015 IEEE 7th International Symposium on Cyberspace Safety and Security (CSS), 2015 IEEE 12th International Conferen on Embedded Software and Systems (ICESS), 2015 IEEE 17th International Conference on
Type :
conf
DOI :
10.1109/HPCC-CSS-ICESS.2015.9
Filename :
7336154
Link To Document :
بازگشت