Title :
Anomaly detection in large-scale coalition clusters for dependability assurance
Author :
Guan, Qiang ; Smith, Derek ; Fu, Song
Author_Institution :
Dept. of Comput. Sci. & Eng., Univ. of North Texas, Denton, TX, USA
Abstract :
In large-scale high-performance computing systems, component failures become norms instead of exceptions. Failure occurrence as well as its impact on system performance and operation costs are becoming an increasingly important concern to system designers and administrators. When a compute node fails to function properly, health-related data are valuable for troubleshooting. However, it is challenging to effectively identify anomalies from the voluminous amount of noisy, high-dimensional data. Manual detection is time-consuming and error-prone. It does not scale well. In this paper, we present an autonomic mechanism for anomaly detection in coalition clusters. It is composed of a set of techniques that facilitates automatic analysis of system health data. We apply data transformation to format health data in a uniform manner. Then principal variables are chosen by feature selection, which reduces the data size. Clustering and outlier detection are explored to identify nodes with anomalous behavior. We evaluate our prototype implementation on a production institution-wide computational grid. The results show that our mechanism can effectively detect faulty nodes with high accuracy and low computation overhead.
Keywords :
security of data; system monitoring; system recovery; anomaly detection; automatic analysis; autonomic mechanism; component failure; computational grid; data transformation; dependability assurance; failure occurrence; faulty node; feature selection; health-related data; high dimensional data; large-scale coalition cluster; large-scale high performance computing system; operation cost; outlier detection; system health data; system performance; Bayesian methods; Data models; Feature extraction; Joints; Monitoring; Mutual information; Temperature sensors; Anomaly detection; Autonomic systems; Coalition clusters; Compute grids; System dependability;
Conference_Titel :
High Performance Computing (HiPC), 2010 International Conference on
Conference_Location :
Dona Paula
Print_ISBN :
978-1-4244-8518-5
Electronic_ISBN :
978-1-4244-8519-2
DOI :
10.1109/HIPC.2010.5713169