• DocumentCode
    2207431
  • Title

    Finding Local Anomalies in Very High Dimensional Space

  • Author

    De Vries, Timothy ; Chawla, Sanjay ; Houle, Michael E.

  • Author_Institution
    Sch. of Inf. Technol., Univ. of Sydney, Sydney, NSW, Australia
  • fYear
    2010
  • fDate
    13-17 Dec. 2010
  • Firstpage
    128
  • Lastpage
    137
  • Abstract
    Time, cost and energy efficiency are critical factors for many data analysis techniques when the size and dimensionality of data is very large. We investigate the use of Local Outlier Factor (LOF) for data of this type, providing a motivating example from real world data. We propose Projection-Indexed Nearest-Neighbours (PINN), a novel technique that exploits extended nearest neighbour sets in the a reduced dimensional space to create an accurate approximation for k-nearest-neighbour distances, which is used as the core density measurement within LOF. The reduced dimensionality allows for efficient sub-quadratic indexing in the number of items in the data set, where previously only quadratic performance was possible. A detailed theoretical analysis of Random Projection(RP) and PINN shows that we are able to preserve the density of the intrinsic manifold of the data set after projection. Experimental results show that PINN outperforms the standard projection methods RP and PCA when measuring LOF for many high-dimensional real-world data sets of up to 300000 elements and 102600 dimensions.
  • Keywords
    approximation theory; data analysis; data mining; random processes; set theory; anomaly detection; approximation; data analysis; dimensionality reduction; k-nearest neighbour; local outlier factor; projection indexed nearest-neighbour; random projection; subquadratic indexing; anomaly detection; dimensionality reduction;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Data Mining (ICDM), 2010 IEEE 10th International Conference on
  • Conference_Location
    Sydney, NSW
  • ISSN
    1550-4786
  • Print_ISBN
    978-1-4244-9131-5
  • Electronic_ISBN
    1550-4786
  • Type

    conf

  • DOI
    10.1109/ICDM.2010.151
  • Filename
    5693966