• DocumentCode
    843645
  • Title

    TAPER: a two-step approach for all-strong-pairs correlation query in large databases

  • Author

    Xiong, Hui ; Shekhar, Shashi ; Tan, Pang-Ning ; Kumar, Vipin

  • Author_Institution
    Manage. Sci. & Inf. Syst. Dept., Rutgers Univ., Newark, NJ, USA
  • Volume
    18
  • Issue
    4
  • fYear
    2006
  • fDate
    4/1/2006 12:00:00 AM
  • Firstpage
    493
  • Lastpage
    508
  • Abstract
    Given a user-specified minimum correlation threshold θ and a market-basket database with N items and T transactions, an all-strong-pairs correlation query finds all item pairs with correlations above the threshold θ. However, when the number of items and transactions are large, the computation cost of this query can be very high. The goal of this paper is to provide computationally efficient algorithms to answer the all-strong-pairs correlation query. Indeed, we identify an upper bound of Pearson´s correlation coefficient for binary variables. This upper bound is not only much cheaper to compute than Pearson´s correlation coefficient, but also exhibits special monotone properties which allow pruning of many item pairs even without computing their upper bounds. A two-step all-strong-pairs correlation query (TAPER) algorithm is proposed to exploit these properties in a filter-and-refine manner. Furthermore, we provide an algebraic cost model which shows that the computation savings from pruning is independent of or improves when the number of items is increased in data sets with Zipf-like or linear rank-support distributions. Experimental results from synthetic and real-world data sets exhibit similar trends and show that the TAPER algorithm can be an order of magnitude faster than brute-force alternatives. Finally, we demonstrate that the algorithmic ideas developed in the TAPER algorithm can be extended to efficiently compute negative correlation and uncentered Pearson´s correlation coefficient.
  • Keywords
    correlation methods; data mining; query processing; statistical analysis; transaction processing; very large databases; Pearson correlation coefficient; TAPER algorithm; algebraic cost model; data mining; item pair pruning; large databases; market-basket database; statistical computing; two-step all-strong-pairs correlation query algorithm; user-specified minimum correlation threshold; Computational efficiency; Costs; Data analysis; Data mining; Distributed computing; Marketing and sales; Matrices; Public healthcare; Transaction databases; Upper bound; Association analysis; Pearson´s correlation coefficient; data mining; statistical computing.;
  • fLanguage
    English
  • Journal_Title
    Knowledge and Data Engineering, IEEE Transactions on
  • Publisher
    ieee
  • ISSN
    1041-4347
  • Type

    jour

  • DOI
    10.1109/TKDE.2006.1599388
  • Filename
    1599388