DocumentCode
2207431
Title
Finding Local Anomalies in Very High Dimensional Space
Author
De Vries, Timothy ; Chawla, Sanjay ; Houle, Michael E.
Author_Institution
Sch. of Inf. Technol., Univ. of Sydney, Sydney, NSW, Australia
fYear
2010
fDate
13-17 Dec. 2010
Firstpage
128
Lastpage
137
Abstract
Time, cost and energy efficiency are critical factors for many data analysis techniques when the size and dimensionality of data is very large. We investigate the use of Local Outlier Factor (LOF) for data of this type, providing a motivating example from real world data. We propose Projection-Indexed Nearest-Neighbours (PINN), a novel technique that exploits extended nearest neighbour sets in the a reduced dimensional space to create an accurate approximation for k-nearest-neighbour distances, which is used as the core density measurement within LOF. The reduced dimensionality allows for efficient sub-quadratic indexing in the number of items in the data set, where previously only quadratic performance was possible. A detailed theoretical analysis of Random Projection(RP) and PINN shows that we are able to preserve the density of the intrinsic manifold of the data set after projection. Experimental results show that PINN outperforms the standard projection methods RP and PCA when measuring LOF for many high-dimensional real-world data sets of up to 300000 elements and 102600 dimensions.
Keywords
approximation theory; data analysis; data mining; random processes; set theory; anomaly detection; approximation; data analysis; dimensionality reduction; k-nearest neighbour; local outlier factor; projection indexed nearest-neighbour; random projection; subquadratic indexing; anomaly detection; dimensionality reduction;
fLanguage
English
Publisher
ieee
Conference_Titel
Data Mining (ICDM), 2010 IEEE 10th International Conference on
Conference_Location
Sydney, NSW
ISSN
1550-4786
Print_ISBN
978-1-4244-9131-5
Electronic_ISBN
1550-4786
Type
conf
DOI
10.1109/ICDM.2010.151
Filename
5693966
Link To Document