DocumentCode :
3596078
Title :
ACR: Automatic checkpoint/restart for soft and hard error protection
Author :
Xiang Ni ; Meneses, Esteban ; Jain, Nikhil ; Kale, Laxmikant V.
Author_Institution :
Dept. of Comput. Sci., Univ. of Illinois at Urbana-Champaign, Urbana, IL, USA
fYear :
2013
Firstpage :
1
Lastpage :
12
Abstract :
As machines increase in scale, many researchers have predicted that failure rates will correspondingly increase. Soft errors do not inhibit execution, but may silently generate incorrect results. Recent trends have shown that soft error rates are increasing, and hence they must be detected and handled to maintain correctness. We present a holistic methodology for automatically detecting and recovering from soft or hard faults with minimal application intervention. This is demonstrated by ACR: an automatic checkpoint/restart framework that performs application replication and automatically adapts the checkpoint period using online information about the current failure rate. ACR performs an application- and user-oblivious recovery. We empirically test ACR by injecting failures that follow different distributions for five applications and show low overhead when scaled to 131,072 cores. We also analyze the interaction between soft and hard errors and propose three recovery schemes that explore the trade-off between performance and reliability requirements.
Keywords :
checkpointing; error correction; fault tolerant computing; ACR; application-recovery; automatic checkpoint-restart framework; checkpoint period; failure rates; hard error protection; online information; soft error protection; user-oblivious recovery; Checkpointing; Computer crashes; Fault tolerant systems; Redundancy; Resilience; Fault-tolerance; checkpoint/restart; redundancy; silent data corruption;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
High Performance Computing, Networking, Storage and Analysis (SC), 2013 International Conference for
Print_ISBN :
978-1-4503-2378-9
Type :
conf
DOI :
10.1145/2503210.2503266
Filename :
6877440
Link To Document :
بازگشت