Title :
Matrix Multiplication on GPUs with On-Line Fault Tolerance
Author :
Ding, Chong ; Karlsson, Christer ; Liu, Hui ; Davies, Teresa ; Chen, Zizhong
Abstract :
Commercial graphics processing units (GPUs) prove their attractive, inexpensive in high performance scientific applications. However, a recent research through Folding@home demonstrates that two-thirds of tested GPUs on Folding@home exhibit a detectable, pattern-sensitive rate of memory soft errors for GPGPU. Fault tolerance has been viewed as critical to the effective use of these GPUs. In this paper, we present an on-line GPU error detection, location, and correction method to incorporate fault tolerance into matrix multiplication. The main contribution of the paper is to extend the traditional algorithm-based fault tolerance (ABFT) from offline to online and apply it to matrix multiplication on GPUs. The proposed on-line fault tolerance mechanism detects soft errors in the middle of the computation so that better reliability can be achieved by correcting corrupted computations in time. Experimental results demonstrate that the proposed method is highly efficient.
Keywords :
coprocessors; fault tolerant computing; matrix multiplication; Folding@home; GPGPU; algorithm-based fault tolerance; graphics processing units; high performance scientific application; matrix multiplication; memory soft errors; on-line GPU error correction; on-line GPU error detection; on-line GPU error location; on-line fault tolerance; soft error detection; Fault tolerance; Fault tolerant systems; Graphics processing unit; Hardware; Random access memory; Tunneling magnetoresistance; Fault Tolerance; GPUs; Matrix Multiplication; Soft Errors;
Conference_Titel :
Parallel and Distributed Processing with Applications (ISPA), 2011 IEEE 9th International Symposium on
Conference_Location :
Busan
Print_ISBN :
978-1-4577-0391-1
Electronic_ISBN :
978-0-7695-4428-1
DOI :
10.1109/ISPA.2011.50