• DocumentCode
    848600
  • Title

    On the Design and Applicability of Distance Functions in High-Dimensional Data Space

  • Author

    Hsu, Chih-Ming ; Chen, Ming-Syan

  • Author_Institution
    Dept. of Electr. Eng., Nat. Taiwan Univ., Taipei
  • Volume
    21
  • Issue
    4
  • fYear
    2009
  • fDate
    4/1/2009 12:00:00 AM
  • Firstpage
    523
  • Lastpage
    536
  • Abstract
    Effective distance functions in high dimensional data space are very important in solutions for many data mining problems. Recent research has shown that if the Pearson variation of the distance distribution converges to zero with increasing dimensionality, the distance function will become unstable (or meaningless) in high dimensional space, even with the commonly used Lp metric in the Euclidean space. This result has spawned many studies the along the same lines. However, the necessary condition for unstability of a distance function, which is required for function design, remains unknown. In this paper, we shall prove that several important conditions are in fact equivalent to unstability. Based on these theoretical results, we employ some effective and valid indices for testing the stability of a distance function. In addition, this theoretical analysis inspires us that unstable phenomena are rooted in variation of the distance distribution. To demonstrate the theoretical results, we design a meaningful distance function, called the shrinkage-divergence proximity (SDP), based on a given distance function. It is shown empirically that the SDP significantly outperforms other measures in terms of stability in high dimensional data space, and is thus more suitable for distance-based clustering applications.
  • Keywords
    data mining; query processing; Pearson variation; data mining; distance distribution; distance functions; high-dimensional data space; query; second moment coefficient; shrinkage-divergence proximity; Classification algorithms; Clustering algorithms; Data mining; Degradation; Extraterrestrial measurements; Indexing; Nearest neighbor searches; Stability; Sufficient conditions; Testing; Clustering; Data mining; Feature extraction or construction;
  • fLanguage
    English
  • Journal_Title
    Knowledge and Data Engineering, IEEE Transactions on
  • Publisher
    ieee
  • ISSN
    1041-4347
  • Type

    jour

  • DOI
    10.1109/TKDE.2008.178
  • Filename
    4609382