مرکز منطقه ای اطلاع رساني علوم و فناوري - Beyond independence: probabilistic models for query approximation on binary transaction data

DocumentCode :

827130

Title :

Beyond independence: probabilistic models for query approximation on binary transaction data

Author :

Pavlov, Dmitry ; Mannila, Heikki ; Smyth, Padhraic

Author_Institution :

NEC Res. Inst., Princeton, NJ, USA

Volume :

Issue :

fYear :

2003

Firstpage :

1409

Lastpage :

1421

Abstract :

We investigate the problem of generating fast approximate answers to queries posed to large sparse binary data sets. We focus in particular on probabilistic model-based approaches to this problem and develop a number of techniques that are significantly more accurate than a baseline independence model. In particular, we introduce two techniques for building probabilistic models from frequent itemsets: the itemset maximum entropy model and the itemset inclusion-exclusion model. In the maximum entropy model, we treat itemsets as constraints on the distribution of the query variables and use the maximum entropy principle to build a joint probability model for the query attributes online. In the inclusion-exclusion model, itemsets and their frequencies are stored in a data structure, called an ADtree, that supports an efficient implementation of the inclusion-exclusion principle in order to answer the query. We empirically compare these two itemset-based models to direct querying of the original data, querying of samples of the original data, as well as other probabilistic models such as the independence model, the Chow-Liu tree model, and the Bernoulli mixture model. These models are able to handle high-dimensionality (hundreds or thousands of attributes), whereas most other work on this topic has focused on relatively low-dimensional OLAP problems. Experimental results on both simulated and real-world transaction data sets illustrate various fundamental trade offs between approximation error, model complexity, and the online time required to compute a query answer.

Keywords :

computational complexity; maximum entropy methods; probability; query processing; transaction processing; tree data structures; very large databases; ADtree; Bernoulli mixture model; Chow-Liu tree model; approximation error; binary transaction data; data structure; fast approximate answer generation; frequent itemsets; high-dimensionality problems; itemset inclusion-exclusion model; itemset maximum entropy model; joint probability model; large sparse binary data sets; model complexity; online time; probabilistic models; query approximation; query variable distribution constraints; transaction data sets; Approximation error; Computational modeling; Data analysis; Data mining; Data structures; Database systems; Entropy; Frequency estimation; Itemsets;

fLanguage :

English

Journal_Title :

Knowledge and Data Engineering, IEEE Transactions on

Publisher :

ieee

ISSN :

1041-4347

Type :

jour

DOI :

10.1109/TKDE.2003.1245281

Filename :

1245281

Link To Document :

https://search.ricest.ac.ir/dl/search/defaultta.aspx?DTC=49&DC=827130