• DocumentCode
    3248626
  • Title

    High performance data mining using the nearest neighbor join

  • Author

    Böhm, Christian ; Krebs, Florian

  • fYear
    2002
  • fDate
    2002
  • Firstpage
    43
  • Lastpage
    50
  • Abstract
    The similarity join has become an important database primitive to support similarity search and data mining. A similarity join combines two sets of complex objects such that the result contains all pairs of similar objects. Well-known are two types of the similarity join, the distance range join where the user defines a distance threshold for the join, and the closest point query or k-distance join which retrieves the k most similar pairs. In this paper, we investigate an important, third similarity join operation called k-nearest neighbor join which combines each point Of one point set with its k nearest neighbors in the other set. It has been shown that many standard algorithms of Knowledge Discovery in Databases (KDD) such as k-means and k-medoid clustering, nearest neighbor classification, data cleansing, postprocessing of sampling-based data mining etc. can be implemented on top of the k-nn join operation to achieve performance improvements without affecting the quality of the result of these algorithms. We propose a new algorithm to compute the k-nearest neighbor join using the multipage index (MuX), a specialized index structure for the similarity join. To reduce both CPU and I/O cost, we develop optimal loading and processing strategies.
  • Keywords
    data mining; database theory; query processing; data mining; database primitive; multidimensional databases; multipage index; similarity join; similarity search; Acceleration; Biomedical informatics; Clustering algorithms; Cost function; Data analysis; Data mining; Databases; Multidimensional systems; Nearest neighbor searches; Performance gain;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Data Mining, 2002. ICDM 2003. Proceedings. 2002 IEEE International Conference on
  • Print_ISBN
    0-7695-1754-4
  • Type

    conf

  • DOI
    10.1109/ICDM.2002.1183884
  • Filename
    1183884