Title :
Mining deviants in time series data streams
Author :
Muthukrishnan, S. ; Shah, Rahul ; Vitter, Jeffrey Scott
Author_Institution :
Rutgers Univ., Piscataway, NJ, USA
Abstract :
One of the central tasks in managing, monitoring and mining data streams is that of identifying outliers. There is a long history of study of various outliers in statistics and databases, and a recent focus on mining outliers in data streams. Here, we adopt the notion of "deviants" from Jagadish et al. (1999) as outliers. Deviants are based on one of the most fundamental statistical concept of standard deviation (or variance). Formally, deviants are defined based on a representation sparsity metric, i.e., deviants are values whose removal from the dataset leads to an improved compressed representation of the remaining items. Thus, deviants are not global maxima/minima, but rather these are appropriate local aberrations. Deviants are known to be of great mining value in time series databases. We present first-known algorithms for identifying deviants on massive data streams. Our algorithms monitor streams using very small space (polylogarithmic in data size) and are able to quickly find deviants at any instant, as the data stream evolves over time. For all versions of this problem - uni- vs multivariate time series, optimal vs near-optimal vs heuristic solutions, offline vs streaming - our algorithms have the same framework of maintaining a hierarchical set of candidate deviants that are updated as the time series data gets progressively revealed. We show experimentally using real network traffic data (SNMP aggregate time series) as well as synthetic data that our algorithm is remarkably accurate in determining the deviants.
Keywords :
computational complexity; data analysis; data mining; statistical databases; temporal databases; time series; SNMP; compressed representation; data management; data mining; data monitoring; deviant mining; heuristic solution; local aberration; multivariate time series; near-optimal solutions; network traffic data; outlier identification; polylogarithmic data size; representation sparsity metric; standard deviation; statistics; time series data streams; time series databases; univariate time series; variance; Aggregates; Data mining; Databases; History; Internet telephony; Monitoring; Road transportation; Statistics; Telecommunication traffic; Temperature;
Conference_Titel :
Scientific and Statistical Database Management, 2004. Proceedings. 16th International Conference on
Print_ISBN :
0-7695-2146-0
DOI :
10.1109/SSDM.2004.1311192