• DocumentCode
    245013
  • Title

    Mp-Dissimilarity: A Data Dependent Dissimilarity Measure

  • Author

    Aryal, Sunil ; Kai Ming Ting ; Haffari, Gholamreza ; Washio, Takashi

  • Author_Institution
    Clayton Sch. of Inf. Technol., Monash Univ., Melbourne, VIC, Australia
  • fYear
    2014
  • fDate
    14-17 Dec. 2014
  • Firstpage
    707
  • Lastpage
    712
  • Abstract
    Nearest neighbour search is a core process in many data mining algorithms. Finding reliable closest matches of a query in a high dimensional space is still a challenging task. This is because the effectiveness of many dissimilarity measures, that are based on a geometric model, such as lp-norm, decreases as the number of dimensions increases. In this paper, we examine how the data distribution can be exploited to measure dissimilarity between two instances and propose a new data dependent dissimilarity measure called ´mp-dissimilarity´. Rather than relying on geometric distance, it measures the dissimilarity between two instances in each dimension as a probability mass in a region that encloses the two instances. It deems the two instances in a sparse region to be more similar than two instances in a dense region, though these two pairs of instances have the same geometric distance. Our empirical results show that the proposed dissimilarity measure indeed provides a reliable nearest neighbour search in high dimensional spaces, particularly in sparse data. Mp-dissimilarity produced better task specific performance than lp-norm and cosine distance in classification and information retrieval tasks.
  • Keywords
    data mining; probability; query processing; search problems; cosine distance; data dependent dissimilarity measure; data distribution; data mining algorithms; geometric distance; geometric model; high dimensional space; information retrieval tasks; lp-norm; mp-dissimilarity measures; nearest neighbour search; probability mass; reliable nearest neighbour search; Accuracy; Approximation methods; Data mining; Educational institutions; Electronic mail; Information retrieval; Vectors; distance measure; lp-norm; mp-dissimilarity;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Data Mining (ICDM), 2014 IEEE International Conference on
  • Conference_Location
    Shenzhen
  • ISSN
    1550-4786
  • Print_ISBN
    978-1-4799-4303-6
  • Type

    conf

  • DOI
    10.1109/ICDM.2014.33
  • Filename
    7023388