Title :
Enabling the discovery of recurring anomalies in aerospace problem reports using high-dimensional clustering techniques
Author :
Srivastava, A.N.
Author_Institution :
NASA Ames Res. Center, Moffett Field, CA
Abstract :
This paper describes the results of a significant research and development effort conducted at NASA Ames Research Center to develop new text mining algorithms to discover anomalies in free-text reports regarding system health and safety of two aerospace systems. We discuss two problems of significant import in the aviation industry. The first problem is that of automatic anomaly discovery concerning an aerospace system through the analysis of tens of thousands of free-text problem reports that are written about the system. The second problem that we address is that of automatic discovery of recurring anomalies, i.e., anomalies that may be described in different ways by different authors, at varying times and under varying conditions, but that are truly about the same part of the system. The intent of recurring anomaly identification is to determine project or system weakness or high-risk issues. The discovery of recurring anomalies is a key goal in building safe, reliable, and cost-effective aerospace systems. We address the anomaly discovery problem on thousands of free-text reports using two strategies: (1) as an unsupervised learning problem where an algorithm takes free-text reports as input and automatically groups them into different bins, where each bin corresponds to a different unknown anomaly category; and (2) as a supervised learning problem where the algorithm classifies the free-text reports into one of a number of known anomaly categories. We then discuss the application of these methods to the problem of discovering recurring anomalies. In fact, because recurring anomalies tend to have very small cluster sizes, we explore new methods and measures to enhance the original approach for anomaly detection. We present our results on the identification of recurring anomalies in problem reports concerning two aerospace systems as well as benchmark data sets that are widely used in the field of text mining. The first system is the Aviation Safety Reporting Sys- - tem (ASRS) database, which contains several hundred-thousand free text reports filed by commercial pilots concerning safety issues on commercial airlines. The second aerospace system we analyze is the NASA Space Shuttle problem reports as represented in the CARS data set, which consists of 7440 NASA Shuttle problem reports. We show significant classification accuracies on both of these systems as well as compare our results with reports classified into anomaly categories by field experts
Keywords :
data mining; database management systems; space research; aerospace problem reports; aerospace systems; anomaly discovery problem; aviation industry; high dimensional clustering; recurring anomalies identification; supervised learning problem; text mining algorithms; unsupervised learning problem; Aerospace industry; Aerospace safety; Air safety; Clustering algorithms; Health and safety; NASA; Research and development; Supervised learning; Text mining; Unsupervised learning;
Conference_Titel :
Aerospace Conference, 2006 IEEE
Conference_Location :
Big Sky, MT
Print_ISBN :
0-7803-9545-X
DOI :
10.1109/AERO.2006.1656136