DocumentCode
2164610
Title
Detecting anomalies for high performance computing resilience
Author
Dueño, Shamir J Quiñones ; Sáez, Emmanuel Avilés ; Bonaparte, Yael M Camacho ; Kettani, Houssain
Author_Institution
Electr. & Comput. Eng. & Comput. Sci. Dept., Polytech. Univ. of Puerto Rico, San Juan, PR, USA
Volume
4
fYear
2010
fDate
26-28 Feb. 2010
Firstpage
5
Lastpage
7
Abstract
Supercomputers are being used increasingly by scientists and engineers to process data intensive applications. System restore points (checkpoints) are used to restore a process at a previously saved state if a system component fails. When failures become too frequent, it is no longer feasible to make application progress with checkpoints. Proactive measures that migrate application components to healthy nodes can increase time between failures and enable application progress. However, proactive measures require timely system information and an ability to predict where failures are likely to occur. This project uses data collected from system nodes to identify anomalous node behavior. Detecting anomalies is the first step to identifying failures and eventually developing a failure prediction capability. The main result of this project is a number of analysis tools for anomaly identification that are based on the R open source software environment for statistical computing and graphics and on the [1]GGobi open source visualization program for exploring high-dimensional data. The tools do not assume any specific set of system attributes. Given a large collection of system attributes recorded at some time intervals, the tools use only those attributes that contain information. Considering the informative attributes in a high-dimensional space, the tools identify anomalies and automatically find the attributes that are most responsible for the identified anomalies. Further exploration of the anomalies and their attributes is enabled by the GGobi visualization program.
Keywords
data visualisation; failure analysis; mainframes; program visualisation; system recovery; GGobi open source visualization program; R open source software environment; anomalies detection; failure prediction capability; high performance computing resilience; statistical computing; supercomputers; system restore points; Application software; Computer displays; Couplings; Data visualization; Graphics; High performance computing; Merging; Open source software; Resilience; Time measurement; Failures; High Performance; Nodes;
fLanguage
English
Publisher
ieee
Conference_Titel
Computer and Automation Engineering (ICCAE), 2010 The 2nd International Conference on
Conference_Location
Singapore
Print_ISBN
978-1-4244-5585-0
Electronic_ISBN
978-1-4244-5586-7
Type
conf
DOI
10.1109/ICCAE.2010.5451794
Filename
5451794
Link To Document