DocumentCode :
580093
Title :
Optimal real number codes for fault tolerant matrix operations
Author :
Zizhong Chen
Author_Institution :
Colorado Sch. of Mines, Golden, CO, USA
fYear :
2009
fDate :
14-20 Nov. 2009
Firstpage :
1
Lastpage :
10
Abstract :
It has been demonstrated recently that single fail-stop process failure in ScaLAPACK matrix multiplication can be tolerated without checkpointing. Multiple simultaneous processor failures can be tolerated without checkpointing by encoding matrices using a real-number erasure correcting code. However, the floating-point representation of a real number in today´s high performance computer architecture introduces round off errors which can be enlarged and cause the loss of precision of possibly all effective digits during recovery when the number of processors in the system is large. In this paper, we present a class of Reed-Solomon style real-number erasure correcting codes which have optimal numerical stability during recovery. We analytically construct the numerically best erasure correcting codes for 2 erasures and develop an approximation method to computationally construct numerically good codes for 3 or more erasures. Experimental results demonstrate that the proposed codes are numerically much more stable than existing codes.
Keywords :
Reed-Solomon codes; approximation theory; checkpointing; fault tolerance; matrix multiplication; multiprocessing systems; numerical stability; parallel architectures; Reed-Solomon style real-number erasure correcting code; ScaLAPACK matrix multiplication; approximation method; checkpointing; fault tolerant matrix operation; floating-point representation; high performance computer architecture; matrix encoding; optimal numerical stability; optimal real number code; precision loss; processor failure; single fail-stop process failure; system recovery;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
High Performance Computing Networking, Storage and Analysis, Proceedings of the Conference on
Conference_Location :
Portland, OR
Type :
conf
DOI :
10.1145/1654059.1654089
Filename :
6375541
Link To Document :
بازگشت