DocumentCode
827130
Title
Beyond independence: probabilistic models for query approximation on binary transaction data
Author
Pavlov, Dmitry ; Mannila, Heikki ; Smyth, Padhraic
Author_Institution
NEC Res. Inst., Princeton, NJ, USA
Volume
15
Issue
6
fYear
2003
Firstpage
1409
Lastpage
1421
Abstract
We investigate the problem of generating fast approximate answers to queries posed to large sparse binary data sets. We focus in particular on probabilistic model-based approaches to this problem and develop a number of techniques that are significantly more accurate than a baseline independence model. In particular, we introduce two techniques for building probabilistic models from frequent itemsets: the itemset maximum entropy model and the itemset inclusion-exclusion model. In the maximum entropy model, we treat itemsets as constraints on the distribution of the query variables and use the maximum entropy principle to build a joint probability model for the query attributes online. In the inclusion-exclusion model, itemsets and their frequencies are stored in a data structure, called an ADtree, that supports an efficient implementation of the inclusion-exclusion principle in order to answer the query. We empirically compare these two itemset-based models to direct querying of the original data, querying of samples of the original data, as well as other probabilistic models such as the independence model, the Chow-Liu tree model, and the Bernoulli mixture model. These models are able to handle high-dimensionality (hundreds or thousands of attributes), whereas most other work on this topic has focused on relatively low-dimensional OLAP problems. Experimental results on both simulated and real-world transaction data sets illustrate various fundamental trade offs between approximation error, model complexity, and the online time required to compute a query answer.
Keywords
computational complexity; maximum entropy methods; probability; query processing; transaction processing; tree data structures; very large databases; ADtree; Bernoulli mixture model; Chow-Liu tree model; approximation error; binary transaction data; data structure; fast approximate answer generation; frequent itemsets; high-dimensionality problems; itemset inclusion-exclusion model; itemset maximum entropy model; joint probability model; large sparse binary data sets; model complexity; online time; probabilistic models; query approximation; query variable distribution constraints; transaction data sets; Approximation error; Computational modeling; Data analysis; Data mining; Data structures; Database systems; Entropy; Frequency estimation; Itemsets;
fLanguage
English
Journal_Title
Knowledge and Data Engineering, IEEE Transactions on
Publisher
ieee
ISSN
1041-4347
Type
jour
DOI
10.1109/TKDE.2003.1245281
Filename
1245281
Link To Document