DocumentCode :
3146059
Title :
Detection and Correction of Silent Data Corruption for Large-Scale High-Performance Computing
Author :
Fiala, David
Author_Institution :
Dept. of Comput. Sci., North Carolina State Univ., Raleigh, NC, USA
fYear :
2011
fDate :
16-20 May 2011
Firstpage :
2069
Lastpage :
2072
Abstract :
Faults have become the norm rather than the exception for high-end computing on clusters with 10s/100s of thousands of cores, and this situation will only become more dire as we reach exascale computing. Exacerbating this situation, some of these faults will not be detected, manifesting themselves as silent errors that will corrupt memory while applications continue to operate but report incorrect results. This paper introduces RedMPI, an MPI library residing in the profiling layer of any standards-compliant MPI implementation. RedMPI is capable of both online detection and correction of soft errors that occur in MPI applications without requiring code changes to application source code. By providing redundancy, RedMPI is capable of transparently detecting corrupt messages from MPI processes that become faulted during execution. Furthermore, with triple redundancy RedMPI "votes\´\´ out MPI messages of a faulted process by replacing corrupted results with corrected results from unfaulted processes. We present an evaluation of RedMPI on an assortment of applications to demonstrate the effectiveness and assess associated overheads. Fault injection experiments establish that RedMPI is not only capable of successfully detecting injected faults, but can also correct these faults while carrying a corrupted application to successful completion without propagating invalid data.
Keywords :
application program interfaces; data handling; fault diagnosis; message passing; parallel processing; MPI library; RedMPI; exascale computing; faulted process; large scale high performance computing; online detection; silent data corruption; soft error correction; standards compliant MPI implementation; Benchmark testing; Laboratories; Libraries; Protocols; Receivers; Redundancy; Software;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Parallel and Distributed Processing Workshops and Phd Forum (IPDPSW), 2011 IEEE International Symposium on
Conference_Location :
Shanghai
ISSN :
1530-2075
Print_ISBN :
978-1-61284-425-1
Electronic_ISBN :
1530-2075
Type :
conf
DOI :
10.1109/IPDPS.2011.379
Filename :
6009019
Link To Document :
بازگشت