DocumentCode :
3678420
Title :
Understanding the Propagation of Error Due to a Silent Data Corruption in a Sparse Matrix Vector Multiply
Author :
Jon Calhoun;Marc Snir;Luke Olson;Maria Garzaran
Author_Institution :
Dept. of Comput. Sci., Univ. of Illinois at Urbana-Champaign, Urbana, IL, USA
fYear :
2015
Firstpage :
541
Lastpage :
542
Abstract :
With the rate of errors that silently effect an application´s state/output expected to increase in future HPC machines, numerous mitigation schemes have been proposed, but little work has been done investigating why these schemes detect some error while other is masked. This paper investigates how silent data corruption (SDC) propagates through a sparse matrix vector multiply (SpMV), a fundamental HPC computation kernel. We discover that analyzing the mathematics of the SpMV limits understanding of SDC propagation. We achieve a more complete understanding by investigating how SDC propagates in a SpMV as it is expressed in machine instructions.
Keywords :
"Sparse matrices","Iterative methods","Kernel","Random access memory","Electric breakdown","Conferences"
Publisher :
ieee
Conference_Titel :
Cluster Computing (CLUSTER), 2015 IEEE International Conference on
Type :
conf
DOI :
10.1109/CLUSTER.2015.101
Filename :
7307650
Link To Document :
بازگشت