DocumentCode :
2441185
Title :
Algorithmic Cholesky factorization fault recovery
Author :
Hakkarinen, Doug ; Chen, Zizhong
Author_Institution :
Dept. of Math. & Comput. Sci., Colorado Sch. of Mines, Golden, CO, USA
fYear :
2010
fDate :
19-23 April 2010
Firstpage :
1
Lastpage :
10
Abstract :
Modeling and analysis of large scale scientific systems often use linear least squares regression, frequently employing Cholesky factorization to solve the resulting set of linear equations. With large matrices, this often will be performed in high performance clusters containing many processors. Assuming a constant failure rate per processor, the probability of a failure occurring during the execution increases linearly with additional processors. Fault tolerant methods attempt to reduce the expected execution time by allowing recovery from failure. This paper presents an analysis and implementation of a fault tolerant Cholesky factorization algorithm that does not require checkpointing for recovery from fail-stop failures. Rather, this algorithm uses redundant data added in an additional set of processors. This differs from previous works with algorithmic methods as it addresses fail-stop failures rather than fail-continue cases. The implementation and experimentation using ScaLAPACK demonstrates that this method has decreasing overhead in relation to overall runtime as the matrix size increases, and thus shows promise to reduce the expected runtime for Cholesky factorizations on very large matrices.
Keywords :
fault tolerant computing; least squares approximations; matrix decomposition; regression analysis; system recovery; ScaLAPACK; algorithmic Cholesky factorization fault recovery; fail-stop failures; fault tolerant method; high performance clusters; large scale scientific systems; linear equations; linear least squares regression; Collaboration; Computer architecture; Computer networks; Delay; Distributed computing; Grid computing; Network topology; Processor scheduling; Routing; Switches; Algorithmic Based Fault Tolerance; Checkpoint Free; Linear Algebra;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Parallel & Distributed Processing (IPDPS), 2010 IEEE International Symposium on
Conference_Location :
Atlanta, GA
ISSN :
1530-2075
Print_ISBN :
978-1-4244-6442-5
Type :
conf
DOI :
10.1109/IPDPS.2010.5470436
Filename :
5470436
Link To Document :
بازگشت