Title :
Detecting anomalies for high performance computing resilience
Author :
Dueño, Shamir J Quiñones ; Sáez, Emmanuel Avilés ; Bonaparte, Yael M Camacho ; Kettani, Houssain
Author_Institution :
Electr. & Comput. Eng. & Comput. Sci. Dept., Polytech. Univ. of Puerto Rico, San Juan, PR, USA
Abstract :
Supercomputers are being used increasingly by scientists and engineers to process data intensive applications. System restore points (checkpoints) are used to restore a process at a previously saved state if a system component fails. When failures become too frequent, it is no longer feasible to make application progress with checkpoints. Proactive measures that migrate application components to healthy nodes can increase time between failures and enable application progress. However, proactive measures require timely system information and an ability to predict where failures are likely to occur. This project uses data collected from system nodes to identify anomalous node behavior. Detecting anomalies is the first step to identifying failures and eventually developing a failure prediction capability. The main result of this project is a number of analysis tools for anomaly identification that are based on the R open source software environment for statistical computing and graphics and on the [1]GGobi open source visualization program for exploring high-dimensional data. The tools do not assume any specific set of system attributes. Given a large collection of system attributes recorded at some time intervals, the tools use only those attributes that contain information. Considering the informative attributes in a high-dimensional space, the tools identify anomalies and automatically find the attributes that are most responsible for the identified anomalies. Further exploration of the anomalies and their attributes is enabled by the GGobi visualization program.
Keywords :
data visualisation; failure analysis; mainframes; program visualisation; system recovery; GGobi open source visualization program; R open source software environment; anomalies detection; failure prediction capability; high performance computing resilience; statistical computing; supercomputers; system restore points; Application software; Computer displays; Couplings; Data visualization; Graphics; High performance computing; Merging; Open source software; Resilience; Time measurement; Failures; High Performance; Nodes;
Conference_Titel :
Computer and Automation Engineering (ICCAE), 2010 The 2nd International Conference on
Conference_Location :
Singapore
Print_ISBN :
978-1-4244-5585-0
Electronic_ISBN :
978-1-4244-5586-7
DOI :
10.1109/ICCAE.2010.5451794