Title :
Beyond independence: probabilistic models for query approximation on binary transaction data
Author :
Pavlov, Dmitry ; Mannila, Heikki ; Smyth, Padhraic
Author_Institution :
NEC Res. Inst., Princeton, NJ, USA
Abstract :
We investigate the problem of generating fast approximate answers to queries posed to large sparse binary data sets. We focus in particular on probabilistic model-based approaches to this problem and develop a number of techniques that are significantly more accurate than a baseline independence model. In particular, we introduce two techniques for building probabilistic models from frequent itemsets: the itemset maximum entropy model and the itemset inclusion-exclusion model. In the maximum entropy model, we treat itemsets as constraints on the distribution of the query variables and use the maximum entropy principle to build a joint probability model for the query attributes online. In the inclusion-exclusion model, itemsets and their frequencies are stored in a data structure, called an ADtree, that supports an efficient implementation of the inclusion-exclusion principle in order to answer the query. We empirically compare these two itemset-based models to direct querying of the original data, querying of samples of the original data, as well as other probabilistic models such as the independence model, the Chow-Liu tree model, and the Bernoulli mixture model. These models are able to handle high-dimensionality (hundreds or thousands of attributes), whereas most other work on this topic has focused on relatively low-dimensional OLAP problems. Experimental results on both simulated and real-world transaction data sets illustrate various fundamental trade offs between approximation error, model complexity, and the online time required to compute a query answer.
Keywords :
computational complexity; maximum entropy methods; probability; query processing; transaction processing; tree data structures; very large databases; ADtree; Bernoulli mixture model; Chow-Liu tree model; approximation error; binary transaction data; data structure; fast approximate answer generation; frequent itemsets; high-dimensionality problems; itemset inclusion-exclusion model; itemset maximum entropy model; joint probability model; large sparse binary data sets; model complexity; online time; probabilistic models; query approximation; query variable distribution constraints; transaction data sets; Approximation error; Computational modeling; Data analysis; Data mining; Data structures; Database systems; Entropy; Frequency estimation; Itemsets;
Journal_Title :
Knowledge and Data Engineering, IEEE Transactions on
DOI :
10.1109/TKDE.2003.1245281