• DocumentCode
    3013072
  • Title

    A cost model and index architecture for the similarity join

  • Author

    Böhm, Christian ; Kriegel, Hans-Peter

  • Author_Institution
    Munich Univ., Germany
  • fYear
    2001
  • fDate
    2001
  • Firstpage
    411
  • Lastpage
    420
  • Abstract
    The similarity join is an important database primitive which has been successfully applied to speed up data mining algorithms. In the similarity join, two point sets of a multidimensional vector space are combined such that the result contains all point pairs where the distance does not exceed a parameter ε. Due to its high practical relevance, many similarity join algorithms have been devised. The authors propose an analytical cost model for the similarity join operation based on indexes. Our problem analysis reveals a serious optimization conflict between CPU time and I/O time: fine-grained index structures are beneficial for CPU efficiency, but deteriorate the I/O performance. As a consequence of this observation, we propose a new index architecture and join algorithm which allows a separate optimization of CPU time and I/O time. Our solution utilizes large pages which are optimized for I/O processing. The pages accommodate a search structure which minimizes the computational effort in the experimental evaluation, and a substantial improvement over competitive techniques is shown
  • Keywords
    database indexing; optimisation; query processing; relational algebra; tree data structures; CPU efficiency; CPU time; I/O performance; I/O processing; I/O time; analytical cost model; competitive techniques; computational effort; cost model; data mining algorithms; database primitive; experimental evaluation; fine-grained index structures; index architecture; join algorithm; large pages; multidimensional vector space; optimization conflict; point pairs; point sets; practical relevance; problem analysis; search structure; similarity join; similarity join algorithms; similarity join operation; Algorithm design and analysis; Biomedical imaging; Clustering algorithms; Costs; Data mining; Image analysis; Multidimensional systems; Performance analysis; Spatial databases; Time series analysis;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Data Engineering, 2001. Proceedings. 17th International Conference on
  • Conference_Location
    Heidelberg
  • ISSN
    1063-6382
  • Print_ISBN
    0-7695-1001-9
  • Type

    conf

  • DOI
    10.1109/ICDE.2001.914854
  • Filename
    914854