Finding Local Anomalies in Very High Dimensional Space

Author

De Vries, Timothy ; Chawla, Sanjay ; Houle, Michael E.

Author_Institution

Sch. of Inf. Technol., Univ. of Sydney, Sydney, NSW, Australia

fYear

2010

fDate

13-17 Dec. 2010

Firstpage

128

Lastpage

137

Abstract

Time, cost and energy efficiency are critical factors for many data analysis techniques when the size and dimensionality of data is very large. We investigate the use of Local Outlier Factor (LOF) for data of this type, providing a motivating example from real world data. We propose Projection-Indexed Nearest-Neighbours (PINN), a novel technique that exploits extended nearest neighbour sets in the a reduced dimensional space to create an accurate approximation for k-nearest-neighbour distances, which is used as the core density measurement within LOF. The reduced dimensionality allows for efficient sub-quadratic indexing in the number of items in the data set, where previously only quadratic performance was possible. A detailed theoretical analysis of Random Projection(RP) and PINN shows that we are able to preserve the density of the intrinsic manifold of the data set after projection. Experimental results show that PINN outperforms the standard projection methods RP and PCA when measuring LOF for many high-dimensional real-world data sets of up to 300000 elements and 102600 dimensions.

Keywords

approximation theory; data analysis; data mining; random processes; set theory; anomaly detection; approximation; data analysis; dimensionality reduction; k-nearest neighbour; local outlier factor; projection indexed nearest-neighbour; random projection; subquadratic indexing; anomaly detection; dimensionality reduction;

fLanguage

English

Publisher

ieee

Conference_Titel

Data Mining (ICDM), 2010 IEEE 10th International Conference on

Conference_Location

Sydney, NSW

ISSN

1550-4786

Print_ISBN

978-1-4244-9131-5

Electronic_ISBN

1550-4786

Type

conf

DOI

10.1109/ICDM.2010.151

Filename

5693966