• DocumentCode
    2506817
  • Title

    Joining massive high-dimensional datasets

  • Author

    Kahveci, Tamer ; Lang, Christian A. ; Singh, Ambuj K.

  • Author_Institution
    Dept. of Comput. Sci., California Univ., Santa Barbara, CA, USA
  • fYear
    2003
  • fDate
    5-8 March 2003
  • Firstpage
    265
  • Lastpage
    276
  • Abstract
    We consider the problem of joining massive datasets. We propose two techniques for minimizing disk I/O cost of join operations for both spatial and sequence data. Our techniques optimize the available buffer space using a global view of the datasets. We build a boolean matrix on the pages of the given datasets using a lower bounding distance predictor. The marked entries of this matrix represent candidate page pairs to be joined. Our first technique joins the marked pages iteratively. Our second technique clusters the marked entries using rectangular dense regions that have minimal perimeter and fit into buffer. These clusters are then ordered so that the total number of common pages between consecutive clusters is maximal. The clusters are then read from disk and joined. Our experimental results on various real datasets show that our techniques are 2 to 86 times faster than the competing techniques for spatial datasets, and 13 to 133 times faster than the competing techniques for sequence datasets.
  • Keywords
    Boolean algebra; matrix algebra; optimisation; statistical analysis; visual databases; boolean matrix; buffer space; candidate page pair; disk I/O cost minimization; high-dimensional dataset joining; join operation; lower bounding distance predictor; rectangular dense region; sequence data; spatial data; Bandwidth; Bioinformatics; Computer science; Costs; Current measurement; Euclidean distance; Genomics; Geographic Information Systems; Spatial databases; Stock markets;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Data Engineering, 2003. Proceedings. 19th International Conference on
  • Print_ISBN
    0-7803-7665-X
  • Type

    conf

  • DOI
    10.1109/ICDE.2003.1260798
  • Filename
    1260798