Title :
Fault tolerant linear algebra: Recovering from fail-stop failures without checkpointing
Author :
Davies, Teresa ; Chen, Zizhong
Author_Institution :
Colorado Sch. of Mines, Golden, CO, USA
Abstract :
Today´s long running high performance computing applications typically tolerate fail-stop failures by checkpointing. While checkpointing is a very general technique and can be applied in a wide range of applications, it often introduces a considerable overhead especially when applications modify a large amount of memory between checkpoints. In this research, we will design highly scalable low overhead fault tolerant schemes according to the specific characteristics of an application. We will focus on linear algebra operations and re-design selected algorithms to tolerate fail-stop failures without checkpointing. We will also incorporate the developed techniques into the widely used numerical linear algebra library package ScaLAPACK.
Keywords :
fault tolerant computing; linear algebra; software libraries; software packages; ScaLAPACK; fail-stop failure; fault tolerance; high performance computing; library package; linear algebra; numerical linear algebra; Checkpointing; Degradation; Fault tolerance; High performance computing; Libraries; Linear algebra; Packaging; Redundancy;
Conference_Titel :
Parallel & Distributed Processing, Workshops and Phd Forum (IPDPSW), 2010 IEEE International Symposium on
Conference_Location :
Atlanta, GA
Print_ISBN :
978-1-4244-6533-0
DOI :
10.1109/IPDPSW.2010.5470775