مرکز منطقه ای اطلاع رساني علوم و فناوري - Fault-tolerant high-performance matrix multiplication: theory and practice

DocumentCode :

3349128

Title :

Fault-tolerant high-performance matrix multiplication: theory and practice

Author :

Gunnels, John A. ; Katz, Daniel S. ; Quintana-Ortí, Enrique S. ; Van de Gejin, R.A.

Author_Institution :

Dept. of Comput. Sci., Texas Univ., Austin, TX, USA

fYear :

2001

fDate :

1-4 July 2001

Firstpage :

Lastpage :

Abstract :

We extend the theory and practice regarding algorithmic fault-tolerant matrix-matrix multiplication, C=AB, in a number of ways. First, we propose low-overhead methods for detecting errors introduced not only in C but also in A and/or B. Second, we show that, theoretically, these methods will detect all errors as long as only one entry, is corrupted. Third we propose a low-overhead roll-back approach to correct errors once detected. Finally, we give a high-performance implementation of matrix-matrix multiplication that incorporates these error detection and correction methods. Empirical results demonstrate that these methods work well in practice while imposing an acceptable level of overhead relative to high-performance implementations without fault-tolerance.

Keywords :

error analysis; fault tolerant computing; matrix multiplication; error correction; error detection; errors; fault-tolerance; fault-tolerant high-performance matrix multiplication; low-overhead roll-back approach; matrix-matrix multiplication; Contracts; Costs; Error correction; Fault tolerance; High performance computing; Laboratories; Linear algebra; NASA; Propulsion; Space technology;

fLanguage :

English

Publisher :

ieee

Conference_Titel :

Dependable Systems and Networks, 2001. DSN 2001. International Conference on

Conference_Location :

Goteborg, Sweden

Print_ISBN :

0-7695-1101-5

Type :

conf

DOI :

10.1109/DSN.2001.941390

Filename :

941390

Link To Document :

https://search.ricest.ac.ir/dl/search/defaultta.aspx?DTC=49&DC=3349128