DocumentCode
2448241
Title
Fault tolerant linear algebra: Recovering from fail-stop failures without checkpointing
Author
Davies, Teresa ; Chen, Zizhong
Author_Institution
Colorado Sch. of Mines, Golden, CO, USA
fYear
2010
fDate
19-23 April 2010
Firstpage
1
Lastpage
4
Abstract
Today´s long running high performance computing applications typically tolerate fail-stop failures by checkpointing. While checkpointing is a very general technique and can be applied in a wide range of applications, it often introduces a considerable overhead especially when applications modify a large amount of memory between checkpoints. In this research, we will design highly scalable low overhead fault tolerant schemes according to the specific characteristics of an application. We will focus on linear algebra operations and re-design selected algorithms to tolerate fail-stop failures without checkpointing. We will also incorporate the developed techniques into the widely used numerical linear algebra library package ScaLAPACK.
Keywords
fault tolerant computing; linear algebra; software libraries; software packages; ScaLAPACK; fail-stop failure; fault tolerance; high performance computing; library package; linear algebra; numerical linear algebra; Checkpointing; Degradation; Fault tolerance; High performance computing; Libraries; Linear algebra; Packaging; Redundancy;
fLanguage
English
Publisher
ieee
Conference_Titel
Parallel & Distributed Processing, Workshops and Phd Forum (IPDPSW), 2010 IEEE International Symposium on
Conference_Location
Atlanta, GA
Print_ISBN
978-1-4244-6533-0
Type
conf
DOI
10.1109/IPDPSW.2010.5470775
Filename
5470775
Link To Document