DocumentCode :
166664
Title :
Exploring void search for fault detection on extreme scale systems
Author :
Berrocal, Eduardo ; Li Yu ; Wallace, Sean ; Papka, Michael E. ; Zhiling Lan
Author_Institution :
Illinois Inst. of Technol., Chicago, IL, USA
fYear :
2014
fDate :
22-26 Sept. 2014
Firstpage :
1
Lastpage :
9
Abstract :
Mean Time Between Failures (MTBF), now calculated in days or hours, is expected to drop to minutes on exascale machines. The advancement of resilience technologies greatly depends on a deeper understanding of faults arising from hardware and software components. This understanding has the potential to help us build better fault tolerance technologies. For instance, it has been proved that combining checkpointing and failure prediction leads to longer checkpoint intervals, which in turn leads to fewer total checkpoints. In this paper we present a new approach for fault detection based on the Void Search (VS) algorithm. VS is used primarily in astrophysics for finding areas of space that have a very low density of galaxies. We evaluate our algorithm using real environmental logs from Mira Blue Gene/Q supercomputer at Argonne National Laboratory. Our experiments show that our approach can detect almost all faults (i.e., sensitivity close to 1) with a low false positive rate (i.e., specificity values above 0.7). We also compare our algorithm with a number of existing detection algorithms, and find that ours outperforms all of them.
Keywords :
checkpointing; fault tolerant computing; search problems; MTBF; Mira Blue Gene/Q supercomputer; VS algorithm; astrophysics; checkpointing; extreme scale systems; failure prediction; fault detection; fault tolerance technologies; galaxies; hardware component; mean time between failures; resilience technologies; software component; void search algorithm; Computers; Lead; Blue Gene/Q; Environmental Data; Fault Detection; Reliability; Void Search;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Cluster Computing (CLUSTER), 2014 IEEE International Conference on
Conference_Location :
Madrid
Type :
conf
DOI :
10.1109/CLUSTER.2014.6968757
Filename :
6968757
Link To Document :
بازگشت