• DocumentCode
    891288
  • Title

    The Concentration of Fractional Distances

  • Author

    François, Damien ; Wertz, Vincent ; Verleysen, Michel

  • Volume
    19
  • Issue
    7
  • fYear
    2007
  • fDate
    7/1/2007 12:00:00 AM
  • Firstpage
    873
  • Lastpage
    886
  • Abstract
    Nearest neighbor search and many other numerical data analysis tools most often rely on the use of the euclidean distance. When data are high dimensional, however, the euclidean distances seem to concentrate; all distances between pairs of data elements seem to be very similar. Therefore, the relevance of the euclidean distance has been questioned in the past, and fractional norms (Minkowski-like norms with an exponent less than one) were introduced to fight the concentration phenomenon. This paper justifies the use of alternative distances to fight concentration by showing that the concentration is indeed an intrinsic property of the distances and not an artifact from a finite sample. Furthermore, an estimation of the concentration as a function of the exponent of the distance and of the distribution of the data is given. It leads to the conclusion that, contrary to what is generally admitted, fractional norms are not always less concentrated than the euclidean norm; a counterexample is given to prove this claim. Theoretical arguments are presented, which show that the concentration phenomenon can appear for real data that do not match the hypotheses of the theorems, in particular, the assumption of independent and identically distributed variables. Finally, some insights about how to choose an optimal metric are given.
  • Keywords
    Books; Content based retrieval; DNA; Data analysis; Databases; Digital cameras; Euclidean distance; Extraterrestrial measurements; Information retrieval; Nearest neighbor searches; Nearest neighbor search; distance concentration; fractional distances.; high-dimensional data;
  • fLanguage
    English
  • Journal_Title
    Knowledge and Data Engineering, IEEE Transactions on
  • Publisher
    ieee
  • ISSN
    1041-4347
  • Type

    jour

  • DOI
    10.1109/TKDE.2007.1037
  • Filename
    4216305